-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slightly better newline handling. #10
base: master
Are you sure you want to change the base?
Conversation
This is tricky. A better hack would be to compare it against the height of the line and treat any deviations by less than half the line height as the same line. But ultimately you can't rely on the correct ordering of the text and a much more complicated text clustering approach is needed. |
I though about that too. So should I implement this? How do I get line height? I'm not really familiar with pdf format. I guess I'd have to mess with the
Yeah, it doesn't work well e.g. with multiple columns, but current state is still better than nothing! For some people it already meets their requirements (me, for example 😁 so thank you and all other contributors :)) |
The correct transformations are in pdf_render. Thankfully I forgot most of them... |
According to the pdf_render next_line in textstate and operation "TextNewline" handler in pdf_tools line height is just
and there is nothing related to my sample text, so I gave up and canceled the trial subscription of acrobat. |
I guess it does not follow xobjects. |
Hi! I think new line should be pushed only if translation Y is lower than 0. Otherwise it breaks a line in the middle of a word in some documents.
Example pdf: https://www.orimi.com/pdf-test.pdf (archived on github for future readers: pdf-test.pdf)
Without my patch it returns a string like this:
Focus on the second line:
Congratulations, your comput[New Line]er
. It shouldn't be like this.With my patch:
It doesn't break line in the middle of a word!
I'm not completely sure if this is 100% right. I don't know why
f32::EPSILON
was here, so correct me if I'm wrong, but I tested it with sample documents which I found in google and it seems to be less broken with my patch.