-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lots of invalid files #5
Comments
Note that most PAGE files where produced with Aletheia which comes from the inventors of the PAGE format. If there is anything wrong with such PAGE files, you should report it there. Some PAGE files were produced by Transkribus. It is a known problem that such PAGE files are "special". See for example Transkribus/TranskribusCore#45, now moved to GitLab. |
Yes, the 2019 files are all good, only the Transkribus ones are invalid. But it's no use looking back at the tool that generated them – this concerns the dataset alone. The first error is trivial to fix, and the second is a question of devising a good mapping scheme from the various descriptions in |
Again that's a known problem of Transkribus. Obviously Transkribus users don't care for correct boxes / polygons. Usually the baselines are better, but it looks like Transkribus does not fix the polygon dimensions when baselines are modified. The same problems exist for all PAGE files of AustrianNewspapers (also produced by Transkribus). And the bad news is that probably most transcription today are still done by people using Transkribus. Nevertheless I see no progress and improvements on that side. |
I agree, the baselines quality is higher than the polygons in Transkribus/P2PaLA results, and yes this applies to many other datasets, too. But still, we need to find a way around this ex-post in the data. I am thinking of looking at different implementations for polygonalization and then writing a dedicated tool for that (working in tandem with good binarization). W.r.t. region level we can either use the approach of ocrd_cis.ocropy.lines2regions, or use parts of ocrd_segment.repair to postprocess the annotated segments. |
Going over the PAGE files with a linter against the actual schema from Transkribus (they hijacked the 2013 namespace), or the upstream 2019 schema (where it applies), yields different sources error on various files:
<Software> Transkribus <Software>
/PcGts/Metadata/Comments
for recursive elements (despite being a simpleType)@points
(both x and y)The text was updated successfully, but these errors were encountered: