Inconsistent coordinates for words, lines and regions in Transkribus PAGE XML #46

stweil · 2020-05-24T11:20:26Z

The NZZ PAGE XML file was created by Transkribus.

Lines extracted with tools from OCR-D show wrong vertical offsets (see OCR-D/format-converters#16). The PRImA page viewer shows that this PAGE file has lots of word boxes which don't cover the word. They are even outside of the corresponding line. There are also lines outside of their text region. Such basic errors should be made impossible by the Transkribus user interface.

jkloe · 2020-05-27T15:44:41Z

Those effects may occur when editing transcriptions on line basis - then word coordinates are not automatically synced.
Also, there are no checks whether lines or words are overlapping with their parent shapes, true. In comparison to the PRImA utils, Transkribus is taking a more liberal approach here, mostly because the primary focus is on creating GT for HTR which only takes into account the baselines and the corresponding text.

This was referenced May 24, 2020

Extracted line images with wrong vertical offset OCR-D/format-converters#16

Closed

Fix transcriptions in evaluation set impresso/NZZ-black-letter-ground-truth#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent coordinates for words, lines and regions in Transkribus PAGE XML #46

Inconsistent coordinates for words, lines and regions in Transkribus PAGE XML #46

stweil commented May 24, 2020

jkloe commented May 27, 2020

Inconsistent coordinates for words, lines and regions in Transkribus PAGE XML #46

Inconsistent coordinates for words, lines and regions in Transkribus PAGE XML #46

Comments

stweil commented May 24, 2020

jkloe commented May 27, 2020