Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lots of invalid files #5

Open
bertsky opened this issue Nov 26, 2021 · 4 comments
Open

lots of invalid files #5

bertsky opened this issue Nov 26, 2021 · 4 comments

Comments

@bertsky
Copy link

bertsky commented Nov 26, 2021

Going over the PAGE files with a linter against the actual schema from Transkribus (they hijacked the 2013 namespace), or the upstream 2019 schema (where it applies), yields different sources error on various files:

  • <Software> Transkribus <Software>
  • abuse of /PcGts/Metadata/Comments for recursive elements (despite being a simpleType)
  • large negative @points (both x and y)
@stweil
Copy link
Member

stweil commented Nov 26, 2021

Note that most PAGE files where produced with Aletheia which comes from the inventors of the PAGE format. If there is anything wrong with such PAGE files, you should report it there.

Some PAGE files were produced by Transkribus. It is a known problem that such PAGE files are "special". See for example Transkribus/TranskribusCore#45, now moved to GitLab.

@bertsky
Copy link
Author

bertsky commented Nov 26, 2021

Yes, the 2019 files are all good, only the Transkribus ones are invalid. But it's no use looking back at the tool that generated them – this concerns the dataset alone. The first error is trivial to fix, and the second is a question of devising a good mapping scheme from the various descriptions in Comments to 2019 version MetadataItems. But the third is not so trivial – it may not be enough to just impose the page frame as a coordinate boundary. I have noticed there are clear quality problems with the polygons themselves (invalidities and oft inconsistency between lines and regions), which often makes it impossible to extract text line images correctly. And then there is still a problem with precision: often, ascenders or descenders or diacritics are not included in the polygons of the handwriting.

@stweil
Copy link
Member

stweil commented Nov 26, 2021

Again that's a known problem of Transkribus. Obviously Transkribus users don't care for correct boxes / polygons. Usually the baselines are better, but it looks like Transkribus does not fix the polygon dimensions when baselines are modified.

The same problems exist for all PAGE files of AustrianNewspapers (also produced by Transkribus).

And the bad news is that probably most transcription today are still done by people using Transkribus. Nevertheless I see no progress and improvements on that side.

@bertsky
Copy link
Author

bertsky commented Nov 26, 2021

I agree, the baselines quality is higher than the polygons in Transkribus/P2PaLA results, and yes this applies to many other datasets, too. But still, we need to find a way around this ex-post in the data.

I am thinking of looking at different implementations for polygonalization and then writing a dedicated tool for that (working in tandem with good binarization). W.r.t. region level we can either use the approach of ocrd_cis.ocropy.lines2regions, or use parts of ocrd_segment.repair to postprocess the annotated segments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants