Skip to content
This repository has been archived by the owner on Nov 16, 2020. It is now read-only.

Inconsistent coordinates for words, lines and regions in Transkribus PAGE XML #46

Open
stweil opened this issue May 24, 2020 · 1 comment

Comments

@stweil
Copy link

stweil commented May 24, 2020

The NZZ PAGE XML file was created by Transkribus.

Lines extracted with tools from OCR-D show wrong vertical offsets (see OCR-D/format-converters#16). The PRImA page viewer shows that this PAGE file has lots of word boxes which don't cover the word. They are even outside of the corresponding line. There are also lines outside of their text region. Such basic errors should be made impossible by the Transkribus user interface.

@jkloe
Copy link
Contributor

jkloe commented May 27, 2020

Those effects may occur when editing transcriptions on line basis - then word coordinates are not automatically synced.
Also, there are no checks whether lines or words are overlapping with their parent shapes, true. In comparison to the PRImA utils, Transkribus is taking a more liberal approach here, mostly because the primary focus is on creating GT for HTR which only takes into account the baselines and the corresponding text.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants