ocr - Document scanning: How to speed up the software part of the scanning process?

07
2014-07
  • user291737

    I am looking for a solution to speed up my document scanning process, especially for those documents that are not suitable for a typical document scanner with an automatic document feeder (ADF). For those documents I currently use a flatbed scanner.

    At first I thought that a faster scanning hardware would be the solution (e.g. a camera scanner instead of a typical flatbed scanner). But I noticed that the total time for a scan consists of only 20 % for the scan hardware (movement of the scan head) but 80 % for the software (image enhancement and optical character recognition).

    To speed up scanning I was looking into the following: (a) scan software that would not only use one core/thread of the CPU but multiple cores/threads. Despite extensive search I could not find a multi-threaded program for TWAIN yet. (b) workflow + software: a program that offers the possibility to define my own scan profiles. But I could not find a software yet that offers scan profiles and at the same time good auto-cropping (and OCR not only in English). (c) workflow i.e. to move OCR into a separate step (but I did not gain any speed by this because the software that is bundled with my CanoScan flatbed scanners takes the same time for a scan no matter whether I include OCR or not)

    How can I speed up my scanning?

    For those that know third party document scanning software on the market: Will I see a considerable difference in speed between an i7 dual-core and an i7 quad-core CPU?

    As document scanning software I understand software that includes image enhancement features (e.g. deskew, auto-crop, descreen), OCR (not only for English), the ability to save to a number of file types (jpg, jpeg2000, TIFF, searchable PDF, PDF/A), and scan profiles (= user defined combination of dpi, image enhancement settings, OCR language, file type).

  • Answers
  • Damon

    First, separate the scanning process from the post processing process. Do this by scanning as a picture at a higher resolution 300-600DPI or more. The files will be large, but only temporary until you post process. File size will be your biggest slow down here, so drop your resolution and bit depth to as low as comfortably possible. (e.g. use grey scale if you do not need color). What you do not want are 24bit 1200DPI image at 8-1/2"x11" that are 100's of MB's each unless you have to; they take too long to save and open.

    Then using any software that suits your needs, run your post processing in a batch at your convenience. All software functions differently, so you will have to learn your software.

    Here is the catch though. Most programs only run on 1 core of your multi-core CPU, so the best way to make things run faster is to open your program multiple times and split the batches between the open instances of the program. Most programs will not open multiple instances, so you have to run the program from either the start menu manually, or from the run command with a special "switch". Depending on your program depends on how you do it. Acrobat for example needs to be ran from the run command as "ACROBAT /N" to open a new instance if an instance is already open.

    If I have upwards of 10,000 pages to post process, then during the day I will open 3 instances on a 4 core computer and split up the jobs across the 3 instances so I can still use the computer (the CPU runs at 75% leaving 25% for "office use"). At night, I will run 4 instances to max out the computer.

    But if I know the post processing will not take that long, say only a few hours, I won't bother with opening up instances; I will simply run a batch and let it go until is complete. With a dual core computer, this would allow you do run your post processing and still use the computer. Most batches will not take that long. be aware if you run 2-3 instances on dual core computer, your computer may not function as a desktop for active use until the batches finish.

    Another option, no matter if you run instances or not, is to go into the windows task manager and change the CPU priority for the instances to below-normal so your active work takes precedence over the background post processing.

    As for the speed, the more cores working, the faster processing will go. The problem is if you have a dual core CPU that you run a single threaded app on, and you buy a comparable quad core and run the same app in the same manner, it will not go any faster. So, the trick is to run your single threaded app multiple times at the same time to max out you CPU's capabilities.

    At the end of the post processing, save the document(s) in your desired format, then QC batch before deleting the images.

    If you use Acrobat and you run large batches, be ready for problems though! Search for solutions and find more people with the same problems too! Acrobat is a PAIN!


  • Related Question

    High Volume Bulk Photo Scanning - Hardware and Software For The Job?
  • Dave Drager

    Let's say that I have a nearly unlimited budget to purchase a high volume photo scanner and I would like the ability to scan both a) modern prints, which could be done through an ADF, and b) older, more fragile photos which would need to be hand scanned via a flatbed scanner.

    What is the best hardware and software to accomplish this goal giving the priorities of:

    1. Integrity of the original
    2. Speed, as far as how many photos you can scan
    3. Quality, must be at least decent

    I can think of a few solutions, but even with the best of hardware it would require a lot of manual scanning on flatbeds. In this case, what software can handle mass processing of flatbed scans with the least amount of human intervention?


  • Related Answers
  • Seasoned Advice (cooking)

    Kodak s1220! Scans 40 pics per minute up to 8" wide at high resolution. There is also a flatbed component (additional fee). See a demo at http://thememorykeepercoach.blogspot.com.