Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly better newline handling. #10

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

makindotcc
Copy link

Hi! I think new line should be pushed only if translation Y is lower than 0. Otherwise it breaks a line in the middle of a word in some documents.
Example pdf: https://www.orimi.com/pdf-test.pdf (archived on github for future readers: pdf-test.pdf)
image

Without my patch it returns a string like this:

    PDF Test File 
     
    Congratulations, your comput
    er is equipped with a PDF (Portable Document Format) reader!  You should be able to view any of
     the PDF documents and forms available on our site.  PDF forms are indicated by these icons: 
      or  
    .   
     
    Yukon Department of Education 
    Box 2703 
    Whitehorse,Yukon 
    Canada 
    Y1A 2C6 
     
    Please visit our website at:  
    http://www.education.gov.yk.ca/

Focus on the second line: Congratulations, your comput[New Line]er. It shouldn't be like this.
With my patch:

    PDF Test File 
     
    Congratulations, your computer is equipped with a PDF (Portable Document Format) 
    reader!  You should be able to view any of the PDF documents and forms available on 
    our site.  PDF forms are indicated by these icons: 
      or  
    .   
     
    Yukon Department of Education 
    Box 2703 
    Whitehorse,Yukon 
    Canada 
    Y1A 2C6 
     
    Please visit our website at:  
    http://www.education.gov.yk.ca/

It doesn't break line in the middle of a word!
I'm not completely sure if this is 100% right. I don't know why f32::EPSILON was here, so correct me if I'm wrong, but I tested it with sample documents which I found in google and it seems to be less broken with my patch.

@s3bk
Copy link
Contributor

s3bk commented Nov 15, 2022

This is tricky.
Previously any significanyt y change would produce a new line, now any downwards change will.

A better hack would be to compare it against the height of the line and treat any deviations by less than half the line height as the same line.

But ultimately you can't rely on the correct ordering of the text and a much more complicated text clustering approach is needed.

@makindotcc
Copy link
Author

makindotcc commented Nov 15, 2022

A better hack would be to compare it against the height of the line and treat any deviations by less than half the line height as the same line.

I though about that too. So should I implement this? How do I get line height? I'm not really familiar with pdf format. I guess I'd have to mess with the text_matrix and font_size?

But ultimately you can't rely on the correct ordering of the text and a much more complicated text clustering approach is needed.

Yeah, it doesn't work well e.g. with multiple columns, but current state is still better than nothing! For some people it already meets their requirements (me, for example 😁 so thank you and all other contributors :))

@makindotcc makindotcc changed the title Don't add newline if translation Y is >= 0 Slightly better newline handling. Nov 15, 2022
@s3bk
Copy link
Contributor

s3bk commented Nov 15, 2022

The correct transformations are in pdf_render. Thankfully I forgot most of them...

@makindotcc
Copy link
Author

According to the pdf_render next_line in textstate and operation "TextNewline" handler in pdf_tools line height is just text_leading, if I understood it correctly? Some PDFs are behaving the same way: new line in text == transform y by -text_leading. I tried to use a adobe acrobat trial to create test document, but pdf_tools::page_text gives to me an empty string. I tried to print out operations:

operation: Save
operation: Restore
operation: BeginMarkedContent { tag: Name("ADBE_FillSign"), properties: None }
operation: Save
operation: XObject { name: Name("Fm0") }
operation: Restore
operation: EndMarkedContent

and there is nothing related to my sample text, so I gave up and canceled the trial subscription of acrobat.
dummy.pdf

@s3bk
Copy link
Contributor

s3bk commented Nov 17, 2022

I guess it does not follow xobjects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants