pdf - How to make an image into a rich formatted document?

08
2014-07
  • Cawas

    So I have a magazine (with couple of pages) in hands and I want to have a resulting file which will keep the whole diagram and image intact, but still allow me to search / find and select text in it.

    This PDF is a small example of a result I wish to have. Even the title is selectable! It does seems to have been created for PDF rather than scanned, but you get the idea.

    The result file doesn't need to be PDF, although I doubt there is any better format for this. The document needs to be a file (for offline reading) and as cross-platform compatible as possible.

    Is there any(simple) solution to this? If not, how could I at least do the OCR's work manually?

  • Answers
  • peanut_butter

    Edit: @Cawas reports that PDF-Exchange viewer accomplished the task successfully, performing OCR on a PDF and making it searchable.

    For the needs that you listed, PDF is probably the simplest and most cross-platform. Another, slightly more obscure alternative is the DJVU format, but unlike PDF there significantly less support for them, especially in terms of OCR.

    There are a number of free Optical Character Recognition software that are available and are easy to use. However, if you are looking for a very simple solution, any PDF document that you upload into Google Drive will automatically have OCR performed on it. There are limitations on this, but it should work for short documents.


  • Related Question

    pdf - Can I force Acrobat Professional to replace the OCR-ed selectable image with text?
  • rumtscho

    I have a book I want to read onscreen. It is scanned at 200 dpi monochrome (I still don't know what went wrong in the scanner driver, I remember setting it to grayscale, but cannot afford the time to scan again), so it is hard to read. I OCRed it with Acrobat Acrobat Pro, and it went reasonably well. But the result is either something called "Searchable image" or "Clearscan". I like the fact that the layout is preserved, but the problem is that the text is shown as it was scanned, so it is difficult to read onscreen. Besides, the whole book takes up 70 MB.

    Here you can see what the already recognized text looks like:

    enter image description here

    I tried other OCR programs, but (besides hogging 100% processor time and memory for 2 min per double page) they all recognized the text, leaving the figures completely out. I don't care that much about the layout and the typography, but the figures are important (I don't need the text labels in the images to be OCRed). And I think that if it were to use ASCII for the text and images for the figures, the size should drop considerably.

    So is there a way to ditch the images of the text and use the OCRed version for reading while keeping the figures in their places? I'd prefer the end result to be a PDF file, but I am open to other formats too. I know I could do it manually by pasting the OCRed text in word and capturing screenshots of the images, but this is too much work for 520 pages.


  • Related Answers
  • Kees

    In Omnipage 16, 17, 18, you can (better layout):

    • select zone types automatic or by hand
    • adjust the seleted zone type, text, picture, table
    • rotate pages
    • change double pages into single pages
    • export to pdf with and without original scanned image (clearer, easier to read)

    The program does on demand (better recognition).

    • straighten pages
    • straighten lines

    Omnipage 17, 18 do straighten curved pages, wrong angles from digital camera images (close ups)

    ABBYY 8,9, 10 do have the same features but gives less results for digital camera pictures.

    ABBYY 10 has a great "On screen Reader". With this you can recognise text parts on your monitor. Or even select text of online books like google books or sribd dot com. Turn you monitor vertical and make sure text is at maximum size.

    Infix works for cleaning up recognised PDF exported as "text with pictures". Easy way to erase wrongly selected part of page with no picture etc. Also adding pages to a PDF or erasing pages.

    Able2Abstract is great for recognising tables. PDF2XL does this too.

    Scan Tailor is a bit unfriendly way, but free, to get just the black text out of a scan. When you are missing parts of a pages do set the individual pages sizes again.

    With Abbyy just getting the black text and pictures is also possible. Here the saved work files do contain B+W tiff pages. You can copy these elsewhere, erase the tumbfiles or metadata and put the tiffs in multitiff or a pdf. This file is bigger than a recognised pdf.

    Photoshop, Paint Shop Pro can help to change the picture of scanned text, single page or batch mode.

    Paperport (not perfect) helps scanning, makes text more black at scanning, fixing text etc after scanning but works only on individual pages, puts single pages into 1 pdf.

    Bookmaker is expensive, older yet, fixes some page curves, blackened sides can be erased. The trial has limitations, but somewhere hidden something can be exported to tiffs, page by page.

    Changing parameters of scanner software can give better output.

    Taking pictures of a 500 page book would take 1 hour.

    • Use tripod
    • iso 100 or 200
    • manual white balance with white paper of book. (Or other paper that is "more white")
    • good light but not direct sunlight
    • look at big shades between pages, turn book halfway when needed
    • do some tests
    • slr use higher f stop like 8 or 11 for better depth of field