Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel processing to OCR text extraction of full documents #124

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ntodd
Copy link

@ntodd ntodd commented Dec 18, 2014

Leverage the GNU Parallel tool to OCR multiple pages in parallel. If Parallel is installed, a full document extraction will generate an image for each page and then spawn a tesseract process for each available core. If Parallel is not installed or a subset of pages are indicated, the old behavior will be used. This speeds up OCR processing significantly on multi-core machines.

With a bit more work, this could be leveraged by the other OCR code paths.

Nate Todd added 2 commits December 18, 2014 17:20
Use GNU Parallel if installed to parallelize tesseract OCR on full document text extraction.  If Parallel is not installed, use previous behavior.
@deuxshaish
Copy link

I like this a lot.. Will test and observe, thanks for the commit

@pickhardt
Copy link

This is a great idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants