Practical OCR solution for converting a large book to a digital format?

07
2014-07
  • Questioner

    I was over by my grandparent's place this past weekend. My grandmother pulled out this giant (~1400 page) book of her family history going back to 1630 or so. Giant nerd that I am, I thought it would be slick to have all the information stored in a database and available from the web. I can handle all the web programming and regular expressions and what not, but what I don't know is the best way to get the text from book to computer.

    I know some kind of OCR will be necessary, from the little research I've done, it seems like my options are:

    1. take a picture of every page with a camera then process the pictures with OCR software
    2. use a scanner to scan each page, then process with OCR software
    3. use some kind of hand held device, like this.

    Does anyone have any ideas about the best way to tackle this problem? I don't want to destroy the book, because as far as I know, it can't be replaced. This is probably the only time I'm ever going to scan a large book, so I don't think I want to spend more than $250 on any kind of device. I don't mind some manual effort here (I realize this will most likely take months), but I'd like to find the most efficient method possible.

    Note about the book: It's only about 20 years old, so it's in pretty good shape. It's monochrome and the pages haven't begun to yellow. Since it is so large though, I worry about possible shadows when the text gets down close to the binding.

  • Answers
  • 8088

    I came across this on Lifehacker quite some time back, and it has been one of my top DIY projects ever since.

    enter image description here

    Replace the iPhone with any camera or imaging, and you get a stack of nice high-res jpegs ready for you to OCR with any software, even (urks!) MS Office... ;)

    Cheap. Effective. DIY. You can't beat an idea like this.

    EDIT: Comments raised up some points about shadows, page curlings, etc. Quite easily resolved for anyone who have literally photo-copied library texts.

    Add a multiple light sources to illuminate the book, and eliminate the shadows.

    slant the book at 90 degrees to the pages don't curl towards the bindings in the middle. It also preserves the binding.

    I'll see if I can give an example and set one up myself.

    EDIT 2 : uploaded sample of how you should hold the book, and also notice the light source from the left.

    enter image description here

  • alex

    From what I know, ABBYY makes the best OCR software, but it's not free. You should try using a trial version of ABBYY FineReader, maybe it will help you.

  • NickSentowski

    You will need to capture the image somehow. Various services exist to do this for you. You will also need someone who is familiar with the content of the text to proofread as OCR is not perfect yet. Especially with anything handwritten.

    Others are discussing your question here: http://ask.metafilter.com/92506/scan-my-books

    Some companies will do this for you: http://www.scandexsystems.com/BookScanning2.html http://www.kirtas.com/index.php?option=com%5Fcontent&view=article&id=13&Itemid=48 http://www.ristech.ca/product.html

    Some Free Software: http://download.cnet.com/Image-To-PDF-OCR-Converter-PDF-E-Book-Maker/3000-6675%5F4-10392924.html

  • Xaq Fixx

    For a large and important to you and your family project like this, a DIY Book Scanner may be the way to go, some designs even sport page turners - http://www.diybookscanner.org/ This one doesn't natively support OCR, but does shoot 600 pages an hour and you can run it through OCR after the fact http://hackaday.com/2011/07/18/diy-book-scanner-processes-600-pageshour/

  • Chris Nava

    You may want to see if a university near you has a whole book scanner and then beg/bribe a student to put your book through it.

  • Greg Buehler

    I would recommend a flatbed scanner rigged for book scanning or a whole book scanner as mentioned by Chris.

    If you can, get your images compiled into a TIFF format as that is industry standard when it comes to document management systems.

    For doing OCR, I would recommend tesseract OCR as it is the framework Google expounded upon for their books project.

  • enter image description here 8088

    while it sounds tempting to automate the process, you may want to invest rather more time and work since this particular book is a personal matter. OCR will do the bulk but you'll have to proofread page by page and compare with the original. keep in mind, the author's mistakes are part of the deal, do not correct them (create footnotes if you feel so inclined). take your time, don't put yourself under pressure, book scanning is donkey work but thoroughness pays and you'll end up with a fine digital copy of your family's chronic. good luck with your endeavour :)


  • Related Question

    osx - Simple, free OCR software (for OS X)?
  • dbr

    Something I've had an occasional need for, but I've never found an application I liked - OCR

    Basically I want to take a photo/scan of a document, and convert it to a text document of some kind (Ideally an option for plain-text, perhaps a .doc or .pages)

    Requirements:

    • Must have a native (Cocoa) GUI, not under X11
    • Free

    Optional pluses:

    • Doesn't require installation, just drag-app-to-Applications-folder (a lot of the OCR utilities I found required libraries to be installed and such)
    • Support images in scanned documents
    • (Apple-)scriptable
    • Open source

  • Related Answers
  • Ludwig Weinzierl

    When I researched this topic a while ago there wasn't any free software for any platform that produced reasonable quality output.

    The Optical character recognition article at Wikipedia lists the following free OCR applications:

    I only tried gocr from these, it has no gui and the qualitiy of its output is very low.

    I's suggest to go with a commercial product. Either ABBY Finereader or OmniPage, both of which have OS X versions. They are often bundled with scanners and you can buy them pretty cheap if you don't need the latest version.

  • Jeremy French

    Tesseract, is free, has an OSX port and a how to

  • Slink84

    Never heard of any good free OCR for Mac :] There is GOCR, but it is rather crappy. From the low cost apps I would recommend VelOCRaptor. You can try it out for free.

  • Haddock

    Sorry, but some command line action is involved in this solution...

    If all you need to do is convert a PDF containing scanned pages into text, the following method has given me good results - where GUI based tools such as VelOCRaptor have failed (I'm talking about a 134 page PDF doc with scanned pages).

    All programs in the tool chain are free or come with OSX.

    • With Preview, save the PDF as TIFF, 150dpi. Make sure you have enough disk space to play with.
    • Run the TIFF through Tesseract (install using MacPorts / Fink)
    • Now you have a raw text file, which can be spell checked using any good editor (TextWrangler, TextEdit, etc.)

    Good luck!

  • Ronald Pottol

    how about evernote? send them the image, they ocr it for you.

  • dbr

    I have just found something called 'PDF OCR X Community Edition. It's pretty basic - it just gets plain text out without formatting. However, it works quite well. I used it for scanning German, even though officiall it only works with English, and it was ok.

  • Darren Meyer

    I haven't been able to find anything for free, but PDFScanner is cheap at $15, is a native OS-X app (Snow Leopard and Lion only, AFAICT), and both scans to an OCR'd PDF and lets you open and OCR an existing PDF. It's the only not-horribly-expensive thing I've found.

  • Troggy

    http://www.thefreecountry.com/utilities/ocr.shtml

    Sounds like if you have microsoft office, there is a tool that can convert images into files. Other than that, i am not seeing anything that is free for os x.

    http://discussions.apple.com/thread.jspa?messageID=9807438 Very recent apple discussion in the support forums

    you did mention you only want a native cocoa app, if you could consider some of the linux builds, you might have some luck as there are a few options there.

  • Larry Gritz

    I know it's not free, but I've had reasonably good experience using ReadIris, which you can find for around $60. Basically it will take scans, jpegs or several other image formats, or even PDFs containing scanned data, do OCR on them, and write as PDF ("searchable" -- i.e., text and/or the image itself).