Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to generate hOCR output instead of raw text when performing OCR via tesseract #81

Closed
wants to merge 2 commits into from

Conversation

jhosteny
Copy link

This patch forces tesseract to genrate hOCR output when the --hocr option is added. It also suppresses text cleaning. This addresses issue #80.

@knowtheory
Copy link
Member

Hey @jhosteny, have you tested out this patch? As far as i'm aware, you have to actually pass in a config file, which this pull request doesn't actually supply.

@jhosteny
Copy link
Author

@knowtheory, sorry for the late reply. Yes, I am using my fork with this change in a project, and no additional configuration is necessary. I'm running with the latest tesseract on ubuntu raring. Here are the details:

tesseract 3.02.01
 leptonica-1.69
  libgif 4.1.6 : libjpeg 8b : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.7

I may have missed something, but it didn't look like there was a test that runs tesseract. If you'd rather wait until one is there, I can work on that as part of a new patch.

@jsfenfen
Copy link

@knowtheory: This works for me while running "Tesseract Open Source OCR Engine v3.02.02" on Ubuntu 12.04, w/ leptonica 1.69. I think that the argument--i.e. "hocr" -- is actually the name of the config file to use, and I'm guessing it only works if a config file of that name is in the right place (maybe /somewhere/tessdata/configs/ ). The documentation isn't especially clear. The hocr file used is defined here http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/configs/hocr -- the whole set of default configs is available here: http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/tessdata/configs

For the sake of argument, would it make sense for the patch to just give the option of specifying a path to a config file? That way a more complex config file could be used, and it wouldn't be explicitly dependent on the tesseract library shipping with the default configs.

@jhosteny
Copy link
Author

Close in lieu of #92

@jhosteny jhosteny closed this Aug 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants