Training Tesseract-OCR for english language fonts

07
2014-07
  • andrew

    I have about 3000 small images of single words that I am trying to convert to text. I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.

     tesseract.exe imagename.png imagename 
    

    produces a text file with the converted text.

    The results I got were terrible with only about 40% of characters successfully converted. I would like to improve the results.

    Does anyone know what the optional configurations that can be given in this command? The required arguments are:

    tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]
    

    Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?

  • Answers
  • Pranaysharma

    One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text


  • Related Question

    imagemagick - Tesseract OCR - Newbie Questions!
  • Questioner

    I just installed Tesseract OCR and was able to convert a TIF image to its corresponding text. The application seems fairly easy to use, but I am struggling finding the documentation that will help me make the most of it. So, here are a couple of questions I hope someone here can help me with:

    1. I'm converting PDFs to TIFs using ImageMagick. What settings do I need to use when I make this conversion? Basically, what image settings would be optimal for OCR?

    2. How do I use hOCR?

    Thanks.


  • Related Answers
  • 8088

    Honestly, I've never been able to find a command for ImageMagick that produces a TIF in the format that Tesseract prefers. My own solution was instead to have ImageMagick convert each PDF page to a PNG, build Tesseract with Leptonica and then run Tesseract against those images.

    And to use hOCR, make a new text file with the following text in it: "tessedit_create_hocr 1". Save it somewhere and when you execute tesseract use:

    tesseract [inputFile] [outputFile] [-l optionalLanguageFile] [PathTohOCRConfigFile]