Can Acrobat 11 be made to do OCR using multiple CPU cores?

08
2014-07
  • tarcman.

    OCR processing takes time. Using multiple CPU cores would speed up processing. Acrobat 10 was not a multithreaded application. How about Acrobat 11? Does 11 by default do OCR using multiple CPU cores (if available)? If not, are there any workarounds, e.g. scripting, to help make Acrobat 11 do OCR using multiple CPU cores? Either through Acrobat's built in scripting language or using external scripts that launch and direct multiple single thread instances of Acrobat to in parallell to parts of the processing job.

    Note: This question is not too localized (not limited to a specific moment in time) because (1) Adobe does not release new major Acrobat versions very often (Acrobat 10 was released two years ago) and (2) Adobe Acrobat is a widely used application.

  • Answers
  • slhck

    I have installed the Acrobat 11 (XI) trial in VirtualBox. Acrobat 11 is single threaded.

    I have also made an external script that starts multiple Acrobat instances (one per CPU core), parallel processes the OCR job and merges the result. A crucial step is to turn on error logging in Acrobat preferences, parse all .log and reprocess any error files. The script (when using 4 cores) still does OCR over two times faster than Acrobat 11 default.

  • Isaac Rabinovitch

    Multithreading needs to built into an application. The developer has to write code that creates threads and that breaks down the task into subtasks that can be allocated to each thread. If the developers of Acrobat fail to do this for their OCR recognition code, there's no way for the user to create the extra logic needed.


  • Related Question

    pdf - Can I force Acrobat Professional to replace the OCR-ed selectable image with text?
  • rumtscho

    I have a book I want to read onscreen. It is scanned at 200 dpi monochrome (I still don't know what went wrong in the scanner driver, I remember setting it to grayscale, but cannot afford the time to scan again), so it is hard to read. I OCRed it with Acrobat Acrobat Pro, and it went reasonably well. But the result is either something called "Searchable image" or "Clearscan". I like the fact that the layout is preserved, but the problem is that the text is shown as it was scanned, so it is difficult to read onscreen. Besides, the whole book takes up 70 MB.

    Here you can see what the already recognized text looks like:

    enter image description here

    I tried other OCR programs, but (besides hogging 100% processor time and memory for 2 min per double page) they all recognized the text, leaving the figures completely out. I don't care that much about the layout and the typography, but the figures are important (I don't need the text labels in the images to be OCRed). And I think that if it were to use ASCII for the text and images for the figures, the size should drop considerably.

    So is there a way to ditch the images of the text and use the OCRed version for reading while keeping the figures in their places? I'd prefer the end result to be a PDF file, but I am open to other formats too. I know I could do it manually by pasting the OCRed text in word and capturing screenshots of the images, but this is too much work for 520 pages.


  • Related Answers
  • Kees

    In Omnipage 16, 17, 18, you can (better layout):

    • select zone types automatic or by hand
    • adjust the seleted zone type, text, picture, table
    • rotate pages
    • change double pages into single pages
    • export to pdf with and without original scanned image (clearer, easier to read)

    The program does on demand (better recognition).

    • straighten pages
    • straighten lines

    Omnipage 17, 18 do straighten curved pages, wrong angles from digital camera images (close ups)

    ABBYY 8,9, 10 do have the same features but gives less results for digital camera pictures.

    ABBYY 10 has a great "On screen Reader". With this you can recognise text parts on your monitor. Or even select text of online books like google books or sribd dot com. Turn you monitor vertical and make sure text is at maximum size.

    Infix works for cleaning up recognised PDF exported as "text with pictures". Easy way to erase wrongly selected part of page with no picture etc. Also adding pages to a PDF or erasing pages.

    Able2Abstract is great for recognising tables. PDF2XL does this too.

    Scan Tailor is a bit unfriendly way, but free, to get just the black text out of a scan. When you are missing parts of a pages do set the individual pages sizes again.

    With Abbyy just getting the black text and pictures is also possible. Here the saved work files do contain B+W tiff pages. You can copy these elsewhere, erase the tumbfiles or metadata and put the tiffs in multitiff or a pdf. This file is bigger than a recognised pdf.

    Photoshop, Paint Shop Pro can help to change the picture of scanned text, single page or batch mode.

    Paperport (not perfect) helps scanning, makes text more black at scanning, fixing text etc after scanning but works only on individual pages, puts single pages into 1 pdf.

    Bookmaker is expensive, older yet, fixes some page curves, blackened sides can be erased. The trial has limitations, but somewhere hidden something can be exported to tiffs, page by page.

    Changing parameters of scanner software can give better output.

    Taking pictures of a 500 page book would take 1 hour.

    • Use tripod
    • iso 100 or 200
    • manual white balance with white paper of book. (Or other paper that is "more white")
    • good light but not direct sunlight
    • look at big shades between pages, turn book halfway when needed
    • do some tests
    • slr use higher f stop like 8 or 11 for better depth of field