-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracted line images with wrong vertical offset #16
Comments
The PRImA page viewer complains about the negative coordinates.
|
See new issue Transkribus/TranskribusCore#45. |
A closer look at |
This is invalid by any interpretation, PAGE-XML syntax forbids negative coordinates. This must be fixed in Transkribus.
There's no need to act on In summary, I don't think this is a bug in either page2img or ocrd-segment-extract-*. |
Thank you. That confirms my latest impression. The Transkribus PAGE for Neue Zürcher Zeitung is at least partially a complete mess, word boxes outside of the corresponding lines, line boxes outside of regions. I see no chance to fix that programmatically and will now try to use the original coordinates which were generated by ABBYY FineReader. |
Closing this issue. I created Transkribus/TranskribusCore#46 to address those errors. |
IIRC @wrznr also uses a pipeline to convert ABBYY output in ALTO format to PAGE (reducing bbox overlap via clipping and resegmentation) but recently discovered a bug introduced by deskewing offset? |
We also noticed negative offsets in PAGE XML exports from Transkribus (one can just set them 0). If I remember correctly, we had sometimes problems running HTR (after running ABBYY for layout recognition) on some pages where typically line regions at the border of the page existed (presumably with negative coordinates). |
Thanks for your report. Setting the negative values for |
Then the problem runs deeper. (There is at least one plausible and harmless reason for negative coordinates, and that's segmenting in a cropped and deskewed image, then converting back to absolute coordinates. The rotation will enlarge the image, introducing an offset, which has to be subtracted when converting the coordinates. But if the segments themselves have an apparent offset after conversion, then there's another problem.) |
Here is an example of line image and matching text, both extracted with
page2img.py
:Donnerstag und Samstag wird das Blatt künftig
Obviously there is a vertical offset, the text belongs to the next line, so a wrong image was extracted. All other line images show a similar vertical offset. The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:
The
PrintSpace
tag is not handled bypage2img.py
, nor is it handled in ocrd_segment.ABBYY produced this PAGE XML which contains good coordinates for the text line:
The text was updated successfully, but these errors were encountered: