Extracted line images with wrong vertical offset #16

stweil · 2020-05-23T18:25:57Z

Here is an example of line image and matching text, both extracted with page2img.py:

Donnerstag und Samstag wird das Blatt künftig

Obviously there is a vertical offset, the text belongs to the next line, so a wrong image was extracted. All other line images show a similar vertical offset. The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:

[...]
<Page imageFilename="0111_nzz_18901222_0_0_a1_p1_1.tif" imageWidth="3839" imageHeight="5551">
    <PrintSpace>
        <Coords points="4,-27 3842,-27 3842,5524 4,5524"/>
    </PrintSpace>
    [...]
    <TextRegion type="paragraph" id="r_5_3" custom="readingOrder {index:12;}">
        <Coords points="117,1676 975,1676 975,1941 117,1941"/>
        [...]
        <TextLine id="tl_32" primaryLanguage="German" custom="readingOrder {index:1;}">
            <Coords points="122,1689 976,1689 976,1741 122,1741"/>
            <Baseline points="121,1762 976,1764"/>
            [...]

The PrintSpace tag is not handled by page2img.py, nor is it handled in ocrd_segment.

ABBYY produced this PAGE XML which contains good coordinates for the text line:

[...]
<Page imageFilename="1200024.tif" imageWidth="3839" imageHeight="5551">
    <PrintSpace>
        <Coords points="0,0 3838,0 3838,5551 0,5551"/>
    </PrintSpace>
    [...]
    <TextRegion type="paragraph" id="r_5_3" custom="readingOrder {index:12;}">
        <Coords points="112,1678 970,1678 970,1943 112,1943"/>
        [...]
        <TextLine id="tl_32" primaryLanguage="German" custom="readingOrder {index:1;}">
            <Coords points="115,1723 969,1723 969,1775 115,1775"/>
            <Baseline points="115,1798 969,1798"/>
            [...]

The text was updated successfully, but these errors were encountered:

stweil · 2020-05-23T19:14:38Z

The PRImA page viewer complains about the negative coordinates. ~~also shows that vertical offset, so displays texts which do not match the line under the mouse pointer for the above PAGE XML and its corresponding TIFF image.~~ ocr-validate also reports an error:

$ ocr-validate page-2013-07-15 *1890*xml
mXSDFilename: /home/stweil/src/github/OCR-D/venv-20200408/share/ocr-fileformat/xsd/page-2013-07-15.xsd
mXMLFilename: /home/stweil/src/github/impresso/NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/nzz_18901222_0_0_a1_p1_1.xml
/home/stweil/src/github/impresso/NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/nzz_18901222_0_0_a1_p1_1.xml fails to validate because: 

cvc-pattern-valid: Value '4,-27 3842,-27 3842,5524 4,5524' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.
At: 16:63

stweil · 2020-05-23T19:55:00Z

See new issue Transkribus/TranskribusCore#45.

stweil · 2020-05-24T07:32:44Z

A closer look at nzz_18901222_0_0_a1_p1_1.xml with the PRImA page viewer shows that only some text regions with their text lines are affected by a vertical shift.

bertsky · 2020-05-24T11:04:59Z

The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:
    <PrintSpace>
        <Coords points="4,-27 3842,-27 3842,5524 4,5524"/>
    </PrintSpace>

This is invalid by any interpretation, PAGE-XML syntax forbids negative coordinates. This must be fixed in Transkribus.

The PrintSpace tag is not handled by page2img.py, nor is it handled in ocrd_segment.

There's no need to act on PrintSpace in any way for an image extractor. All PAGE-XML coordinates are absolute (i.e. they refer to imageFilename). Even on the page level, the only relevant element for cropping a bbox rectangle is Border.

In summary, I don't think this is a bug in either page2img or ocrd-segment-extract-*.

stweil · 2020-05-24T11:14:35Z

Thank you. That confirms my latest impression. The Transkribus PAGE for Neue Zürcher Zeitung is at least partially a complete mess, word boxes outside of the corresponding lines, line boxes outside of regions. I see no chance to fix that programmatically and will now try to use the original coordinates which were generated by ABBYY FineReader.

stweil · 2020-05-24T11:21:12Z

Closing this issue. I created Transkribus/TranskribusCore#46 to address those errors.

bertsky · 2020-05-24T11:33:21Z

and will now try to use the original coordinates which were generated by ABBYY FineReader.

IIRC @wrznr also uses a pipeline to convert ABBYY output in ALTO format to PAGE (reducing bbox overlap via clipping and resegmentation) but recently discovered a bug introduced by deskewing offset?

simon-clematide · 2020-05-24T11:53:43Z

We also noticed negative offsets in PAGE XML exports from Transkribus (one can just set them 0). If I remember correctly, we had sometimes problems running HTR (after running ABBYY for layout recognition) on some pages where typically line regions at the border of the page existed (presumably with negative coordinates).

stweil · 2020-05-24T11:59:48Z

Thanks for your report. Setting the negative values for PrintSpace to zero helps indeed to fix the invalid XML, so it is possible to load the data in the viewer after that fix. It does not cure the wrong word and line boxes.

bertsky · 2020-05-24T13:34:21Z

It does not cure the wrong word and line boxes.

Then the problem runs deeper. (There is at least one plausible and harmless reason for negative coordinates, and that's segmenting in a cropped and deskewed image, then converting back to absolute coordinates. The rotation will enlarge the image, introducing an offset, which has to be subtracted when converting the coordinates. But if the segments themselves have an apparent offset after conversion, then there's another problem.)

stweil mentioned this issue May 23, 2020

Fix transcriptions in evaluation set impresso/NZZ-black-letter-ground-truth#5

Merged

stweil mentioned this issue May 24, 2020

Inconsistent coordinates for words, lines and regions in Transkribus PAGE XML Transkribus/TranskribusCore#46

Open

stweil closed this as completed May 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracted line images with wrong vertical offset #16

Extracted line images with wrong vertical offset #16

stweil commented May 23, 2020 •

edited

Loading

stweil commented May 23, 2020 •

edited

Loading

stweil commented May 23, 2020

stweil commented May 24, 2020

bertsky commented May 24, 2020

stweil commented May 24, 2020 •

edited

Loading

stweil commented May 24, 2020

bertsky commented May 24, 2020

simon-clematide commented May 24, 2020

stweil commented May 24, 2020

bertsky commented May 24, 2020

Extracted line images with wrong vertical offset #16

Extracted line images with wrong vertical offset #16

Comments

stweil commented May 23, 2020 • edited Loading

Donnerstag und Samstag wird das Blatt künftig

stweil commented May 23, 2020 • edited Loading

stweil commented May 23, 2020

stweil commented May 24, 2020

bertsky commented May 24, 2020

stweil commented May 24, 2020 • edited Loading

stweil commented May 24, 2020

bertsky commented May 24, 2020

simon-clematide commented May 24, 2020

stweil commented May 24, 2020

bertsky commented May 24, 2020

stweil commented May 23, 2020 •

edited

Loading

stweil commented May 23, 2020 •

edited

Loading

stweil commented May 24, 2020 •

edited

Loading