Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracted line images with wrong vertical offset #16

Closed
stweil opened this issue May 23, 2020 · 10 comments
Closed

Extracted line images with wrong vertical offset #16

stweil opened this issue May 23, 2020 · 10 comments

Comments

@stweil
Copy link
Contributor

stweil commented May 23, 2020

Here is an example of line image and matching text, both extracted with page2img.py:

Donnerstag und Samstag wird das Blatt künftig

sample line image

Obviously there is a vertical offset, the text belongs to the next line, so a wrong image was extracted. All other line images show a similar vertical offset. The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:

[...]
<Page imageFilename="0111_nzz_18901222_0_0_a1_p1_1.tif" imageWidth="3839" imageHeight="5551">
    <PrintSpace>
        <Coords points="4,-27 3842,-27 3842,5524 4,5524"/>
    </PrintSpace>
    [...]
    <TextRegion type="paragraph" id="r_5_3" custom="readingOrder {index:12;}">
        <Coords points="117,1676 975,1676 975,1941 117,1941"/>
        [...]
        <TextLine id="tl_32" primaryLanguage="German" custom="readingOrder {index:1;}">
            <Coords points="122,1689 976,1689 976,1741 122,1741"/>
            <Baseline points="121,1762 976,1764"/>
            [...]

The PrintSpace tag is not handled by page2img.py, nor is it handled in ocrd_segment.

ABBYY produced this PAGE XML which contains good coordinates for the text line:

[...]
<Page imageFilename="1200024.tif" imageWidth="3839" imageHeight="5551">
    <PrintSpace>
        <Coords points="0,0 3838,0 3838,5551 0,5551"/>
    </PrintSpace>
    [...]
    <TextRegion type="paragraph" id="r_5_3" custom="readingOrder {index:12;}">
        <Coords points="112,1678 970,1678 970,1943 112,1943"/>
        [...]
        <TextLine id="tl_32" primaryLanguage="German" custom="readingOrder {index:1;}">
            <Coords points="115,1723 969,1723 969,1775 115,1775"/>
            <Baseline points="115,1798 969,1798"/>
            [...]
@stweil
Copy link
Contributor Author

stweil commented May 23, 2020

The PRImA page viewer complains about the negative coordinates. also shows that vertical offset, so displays texts which do not match the line under the mouse pointer for the above PAGE XML and its corresponding TIFF image. ocr-validate also reports an error:

$ ocr-validate page-2013-07-15 *1890*xml
mXSDFilename: /home/stweil/src/github/OCR-D/venv-20200408/share/ocr-fileformat/xsd/page-2013-07-15.xsd
mXMLFilename: /home/stweil/src/github/impresso/NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/nzz_18901222_0_0_a1_p1_1.xml
/home/stweil/src/github/impresso/NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/nzz_18901222_0_0_a1_p1_1.xml fails to validate because: 

cvc-pattern-valid: Value '4,-27 3842,-27 3842,5524 4,5524' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.
At: 16:63

@stweil
Copy link
Contributor Author

stweil commented May 23, 2020

See new issue Transkribus/TranskribusCore#45.

@stweil
Copy link
Contributor Author

stweil commented May 24, 2020

A closer look at nzz_18901222_0_0_a1_p1_1.xml with the PRImA page viewer shows that only some text regions with their text lines are affected by a vertical shift.

@bertsky
Copy link

bertsky commented May 24, 2020

The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:

    <PrintSpace>
        <Coords points="4,-27 3842,-27 3842,5524 4,5524"/>
    </PrintSpace>

This is invalid by any interpretation, PAGE-XML syntax forbids negative coordinates. This must be fixed in Transkribus.

The PrintSpace tag is not handled by page2img.py, nor is it handled in ocrd_segment.

There's no need to act on PrintSpace in any way for an image extractor. All PAGE-XML coordinates are absolute (i.e. they refer to imageFilename). Even on the page level, the only relevant element for cropping a bbox rectangle is Border.

In summary, I don't think this is a bug in either page2img or ocrd-segment-extract-*.

@stweil
Copy link
Contributor Author

stweil commented May 24, 2020

Thank you. That confirms my latest impression. The Transkribus PAGE for Neue Zürcher Zeitung is at least partially a complete mess, word boxes outside of the corresponding lines, line boxes outside of regions. I see no chance to fix that programmatically and will now try to use the original coordinates which were generated by ABBYY FineReader.

@stweil
Copy link
Contributor Author

stweil commented May 24, 2020

Closing this issue. I created Transkribus/TranskribusCore#46 to address those errors.

@stweil stweil closed this as completed May 24, 2020
@bertsky
Copy link

bertsky commented May 24, 2020

and will now try to use the original coordinates which were generated by ABBYY FineReader.

IIRC @wrznr also uses a pipeline to convert ABBYY output in ALTO format to PAGE (reducing bbox overlap via clipping and resegmentation) but recently discovered a bug introduced by deskewing offset?

@simon-clematide
Copy link

We also noticed negative offsets in PAGE XML exports from Transkribus (one can just set them 0). If I remember correctly, we had sometimes problems running HTR (after running ABBYY for layout recognition) on some pages where typically line regions at the border of the page existed (presumably with negative coordinates).

@stweil
Copy link
Contributor Author

stweil commented May 24, 2020

Thanks for your report. Setting the negative values for PrintSpace to zero helps indeed to fix the invalid XML, so it is possible to load the data in the viewer after that fix. It does not cure the wrong word and line boxes.

@bertsky
Copy link

bertsky commented May 24, 2020

It does not cure the wrong word and line boxes.

Then the problem runs deeper. (There is at least one plausible and harmless reason for negative coordinates, and that's segmenting in a cropped and deskewed image, then converting back to absolute coordinates. The rotation will enlarge the image, introducing an offset, which has to be subtracted when converting the coordinates. But if the segments themselves have an apparent offset after conversion, then there's another problem.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants