Training Tesseract-OCR for english language fonts
2014-07
I have about 3000 small images of single words that I am trying to convert to text. I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.
tesseract.exe imagename.png imagename
produces a text file with the converted text.
The results I got were terrible with only about 40% of characters successfully converted. I would like to improve the results.
Does anyone know what the optional configurations that can be given in this command? The required arguments are:
tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]
Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?
One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text
I just installed Tesseract OCR and was able to convert a TIF image to its corresponding text. The application seems fairly easy to use, but I am struggling finding the documentation that will help me make the most of it. So, here are a couple of questions I hope someone here can help me with:
I'm converting PDFs to TIFs using ImageMagick. What settings do I need to use when I make this conversion? Basically, what image settings would be optimal for OCR?
How do I use hOCR?
Thanks.
Honestly, I've never been able to find a command for ImageMagick that produces a TIF in the format that Tesseract prefers. My own solution was instead to have ImageMagick convert each PDF page to a PNG, build Tesseract with Leptonica and then run Tesseract against those images.
And to use hOCR, make a new text file with the following text in it: "tessedit_create_hocr 1". Save it somewhere and when you execute tesseract use:
tesseract [inputFile] [outputFile] [-l optionalLanguageFile] [PathTohOCRConfigFile]