windows - Looking for software to rename file name of jpeg scan image of doc to text in the image

08
2014-07

therobyouknow

I'm scanning in many A4 paper documents to JPEG using a automated document feeder scanner.

The results are FILE0001.JPG FILE0002.JPG etc.

I would like a program to rename the file name title to text found in the actual scanned jpeg image itself. Preferrably to determine the title, the program would look for the largest text in the image and which appears closest to the top of the image.

I am aware of several commercial and some free OCR applications and would be willing to purchase if necessary, however these appear to have more than what I need: they convert to PDF etc. whereas I would just prefer to keep it simple and work with the original scanned image.

Would welcome out-of-the box easy to use programs for Windows XP, 7 or MacOS.

Answers

Julian Knight

What you are looking for is something that is extremely complex and unlikely to be reliable even if it could be found I'm afraid.

I think that the best you could hope for would be to make use of either Microsoft OneNote (part of Office) or Evernote (has a free version).

These are both able to OCR images in notes - in the background - leaving any discovered text searchable. I'm not sure whether they would pick up the note title from the text though, they might if you make sure that no other text is in the note. Give them a go.

Be warned though that OCR even of well-scanned typed or typeset documents is far from reliable and even then, knowing what constitutes a title, though easy for humans to parse, is a very hard task for a computer.

UPDATE: The complexity comes from a number of things. The act of OCR'ing an image to text is complex enough for a machine to do. There are so many complexities to language that it is very difficult to pick out meaning from an image even when that image is typeset. Even typeset characters vary massively, especially when scanned due to scanning limitations, changes of angle, smudged or otherwise damaged source text (e.g. a fold in the paper) and so on. Secondly, what is a title? Obvious, you might think - something of a larger size than "average" towards the top of the page? How does the system work out the average font size? Itself a significant task as it needs to "parse" the whole scan. Then there are many combinations of layout - which ones should the machine try to recognise? Take an average business report for example, it may have several title-like text elements.

Each of these bits of processing are going to take significant time on even a modern PC and involve large amounts of data processing: Clean the image, straighten the image (recognising edges and "lines" of text), pick out font styles to understand what is text and what isn't, attempt to recognise the text (probably applying spelling and grammar rules), work out the font sizes and average, identify repeating elements (headers/footers) to ignore, try to identify larger text early in the document. Guess the title, check if it is a valid file name for the platform, change if not, ensure name is unique and unused. Phew!

At best, most OCR tools aim for around 90% accuracy from standard scans with clean, straight-fed documents. Do you want to have 90% accurately titled documents? That might be OK to you but would customers of a product put up with it? After all of the development, would the risk be worth it to vendors?

I don't know the answers to these questions. I can see that it could be a great feature but I'm not aware that anyone offers this (I've done a quick check via Google too).

It would be easier if all of your documents are the same layout. Then you could use "zoning", something that most of the better tools offer and take the appropriate zone as the basis for the file name. This would be more (but not completely) reliable. Perhaps you should check with some of the vendors to see if they are interested in doing this.

Related Answers

pelms

You could download the 30 day trial of Adobe Acrobat Pro and use the 'OCR Text Recognition' function ('Document > OCR Text Recognition > Recognise Text Using OCR...'). In the settings dialog, choose 'Searchable Image' as the output style. This will keep the page image but embed the OCR'ed text so the document will be searchable and allow text to be selected, copied and pasted.

After running the OCR you'll need to confirm or correct words that the OCR is unsure about using the 'Find OCR Suspects' functions.

meda beda

The following products were found listed on Internet, but I haven't used them.

Online OCR

OCR Terminal

OCR Terminal is an online OCR service that performs Optical Character Recognition (OCR) on your scanned images and pdf files and renders them into editable and text searchable documents.

Free OCR

Free-OCR.com is a free online OCR (Optical Character Recognition) tool. You can use this to perform OCR on any image you supply.
This service is free, no registration necessary. We also do not need your email address.
Just upload your image files. Free-OCR takes either a JPG, GIF, TIFF BMP or PDF (only first page). The only restriction is that the images must not be larger than 2MB, no wider or higher than 5000 pixels and there is a limit of 10 image uploads per hour.

Maestro Recognition Server is commercial, but has an online try-it demo.

Free software

FreeOCR - for images only.

FreeOCR is a scan & OCR program including the Tesseract free ocr engine also known as a Tesseract GUI. It includes a Windows installer and It is very simple to use and supports multi-page tiff's, fax documents as well as most image types including compressed Tiff's which the Tesseract engine on its own cannot read .It now has Twain scanning.

pdfsandwich - pdf -> pdf convertor.

pdfsandwich is a command line tool for OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.

Essentially, pdfsandwich is a wrapper script which calls the following binaries: convert, cuneiform, gs, and hocr2pdf. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.

Richard

If you have a Google Account then Google Docs now includes the functionality to upload a PDF file and perform OCR on it.

I've tried it myself and it makes a fair stab at an admittedly well formatted PDF.

The formatting is pretty much destroyed but the text seems to survive.

Jukka Matilainen

Cuneiform + hocr2pdf + Ghostscript: A DIY open-source solution.

I posted a an answer outlining a solution involving a version of the now open-source Cuneiform OCR system and hocr2pdf together with Ghostscript for putting the PDF pages together.

That was specifically for Linux but you can get Cuneiform and Ghostscript for Windows, too. I am not sure about hocr2pdf or an equivalent, though.

jtbandes

Here is a very strange method, which involves letting Google index and OCR it for you on a website, then retrieving it.

rlangner

Try PDFCubed.com Nothing to install, it is all done online. You can send your documents to be processed via the web, email, or dropbox. Scaned PDFs and TIFs are converted into searchable text pdfs and then can be retreived via the web, email, or dropbox.

DaveParillo

Install Imagemagick. Open a cmd window or terminal:

convert myfile.pdf myfile-%02d.jpg

The output will be 1 jpg file for each page in your pdf, myfile-00.jpg, myfile-01.jpg, etc.

Pass each image though an ocr program. I don't have much experience with this, but there seem to be alot of choices.

Convert each page of text back into pdf. You could do this again with imagemagick, but there are other ways as well:

convert page-%02d.txt -density 300x300 -compress jpeg final.pdf

Xavierjazz

Your request seems to be a complicated solution to the problem, although I may not understand the problem correctly. At any rate:

Why not get a PDF writer that will allow you to enter the data directly on to the pdf page?

Home

windows - Looking for software to rename file name of jpeg scan image of doc to text in the image