Editing searchable .pdf OCR

07
2014-07
  • Gruber

    My case is quite specific so I'll try to explain it quickly and precisely. I have to digitalize several old paper sheets of 230mm x 268mm (~ 9" x 27,7") folded in 4 parts; you can find a quick drawn example here to get an idea.

    The scanning and recomposition is not quite the real issue, I'll scan every fold and put it together via photoshop. What I need is a .pdf file with the original scanned page image, and also with the text readable/searchable and indexable for web search engines.
    As you can see in the draw above linked, on the page there are also few ADS box, which I don't really need to be OCR'd, and can be left out.

    Now I've used Acrobat Pro X on the resulting .pdf file I recomposed via Photoshop. The results are quite good, but not perfect of course, and what I find most problematic is to correct wrong elaborated text and delete or exclude non necessary area of the document.

    What I'd like to know is if there is an application for editing underlying OCR text in a more practical way than what Acrobat offers. Adobe gives in the tool pannel a "Find suspects" (which can be really annoying to use), but the suspected text is not always complete with what really is wrong, many times characters it recognise as correct are not at all (ex. italic "l" are considered "/", and similar); unfortunately my text is partially composed of other languages fonts also, like japanese or chinese, and the text turns most of the times as crappy jibberish, so I also need to fix the wrong text accordingly to the selectable characters.

    A sort of compared editor, like in one pane the scanned image, in the other the OCR text of a selected area of the document, would be the most ideal solution, I think, to correct quickly and efficiently the errors.
    Possibilities to define and exclude areas of the scanned document to be processed by the OCR would be another very needed function indeed; I've found that with Acrobat you can use the direct arrow edit tool to remove text frames, which can be sort of functional, even tho quite hard to use since you will most of the time click on the background scanned image..

    Any suggestion for this type of work? Maybe another workflow more practical and/or efficient? Any tips are welcome indeed!

    Im on a Win 7 64 bit machine.

  • Answers
  • user291737

    You might try ABBYY FineReader. It fits the description of your needs.


  • Related Question

    language - scan A4 doc > pdf > ocr > translate to english?
  • adolf garlic

    I've tried using a combination of

    • my home scanner to create a '300 dpi', 'document', 'pdf' (options on Canon all-in-one)
    • ZoHoViewer to create either an RTF or TXT file
    • google docs to translate

    I'm not sure how good or bad a product ZoHoViewer is, but the following:

    Als Arbeitsmarkbehörde haben wir den gesetzlichen Auftrag, die Vermittelbarkeit von

    turns into:

    AlsArbeitsmarktbeh6rde habenwirdengesetzlichenAuftrag,dieVermittelbarkeit vonSt...

    consequently, goog docs makes a pig's breakfast of trying to translate it.

    Does anyone have any better suggestions (preferably free online services)


  • Related Answers
  • ChrisF

    Given that the OCR has converted:

    Als Arbeitsmarkbehörde ...

    to:

    AlsArbeitsmarktbeh6rde ...

    A couple of things spring to mind.

    1. Try scanning at a higher dpi. It looks like it can't recognise the space between the words, a higher dpi might improve that.

    2. Can you set the language of your OCR program? I see that it's converted the "ö" to a "6". While this might be a problem caused by the resolution it might also be that as "ö" isn't an everyday part of English, the program is choosing the "next best" fit - in this case "6".

  • 8088

    There have been several other questions on SuperUser on OCR, which might be worth checking out for possible solutions.

    Most notably this answer by Molly looks promising:

    I really like TopOCR, certainly a great addition to your scan tools:

    • Incredible OCR accuracy, upto 99.8% with a 3 MP camera
    • No page limits, and no extra downloads or components needed
    • Handles images with mixed text and graphics (Manual or Auto Zoning)
    • Tolerates skew and uneven lighting
    • Multiple text output formats, including searchable PDF and HTML
    • Able to read 11 different languages
    • Powerful, easy to use Image Processing with Image Dewarping
    • Supports Smartphones: See some Smartphone samples
    • Includes built-in, full featured Text and Image WYSIWYG Editors
    • Post-processing spell checker for all 11 languages
    • Built-in Text-To-Speech software. How about OCR to MP3?
    • Includes a built-in multi-lingual text translater
    • Supports a Command Line Interface and a GUI
    • Make a high performance document Search and Indexing system
    • Browser Helper Mode supports creating free audio eBooks
    • With TopOCR's Web Engine it's easy to add new features

    alt text

    it's very accurate and works excellent with low quality images such as photographs of pages/documents

    TopOCR is freeware (can be made portable with Universal Extractor)

    Further reading:

    Which OCR software has the most options?

    Practical OCR solution for converting a large book to a digital format?

    How to extract text with OCR from a PDF on Linux?

  • adolf garlic

    Not 100% perfect but the best out of all the things I have tried:

    http://www.paperfile.net/ combined with a language pack (free to download instructions in app) copy and paste whole of the text to a google doc, then use the tools > translate in google docs