windows - Batch-OCR many PDFs

07
2014-07

Joe

This has been discussed a year ago here:

Batch OCR for many PDF files (not already OCRed)?

Is there any way to batch OCR PDFs that haven't been already OCRed? This is, I think, the current state of things dealing with two issues:

Batch OCR PDFs

Windows

Acrobat – This is the most straightfoward ocr engine that will batch OCR. The only problem seems to be 1) it wont skip files that have already been OCRed 2) try throwing a bunch of PDFs at it (some old) and watch it crash. It is a little buggy. It will warn you at each error it runs into (though you can tell the software to not notify. But again, it dies horribly on certain types of PDFs so your mileage may vary.
ABBYY FineReader (Batch/Scansnap), Omnipage – These have got to be some of the worst programmed pieces of software known to man. If you can find out how to fully automate (no prompting) batch OCR of PDFs saving with the same name then please post here. It seems the only solutions I could find failed somewhere--renaming, not fully automated, etc. etc. At best, there is a way to do it, but the documentation and programming is so horrible that you'll never find out.
ABBYY FineReader Engine, ABBYY Recognition Server - These really are more enterprise solutions, you probably would be better off just getting acrobat to run over a folder and try and weed out pdfs that give you errors/crash the program than going through the hassle of trying to install evaluation software (assuming you are a simple end-user). Doesn't seem cost competitive for the small user.
** Autobahn DX workstation ** the cost of this product is so prohibitive, you probably could buy 6 copies of acrobat. Not really an end-user solution. If you're an enterprise setup, this may be worth it for you.

Linux

WatchOCR – no longer developed, and basically impossible to run on modern Ubuntu distros
pdfsandwich – no longer developed, basically impossible to run on modern Ubuntu distros
** ABBY LINUX OCR ** - this should be scriptable, and seems to have some good results:

http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

However, like a lot of these other ABBYY products they charge by the page, again, you might be better off trying to get Acrobat Batch OCR to work.

*Ocrad, GOCR, OCRopus, tesseract, * – these may work but there are a few problems:
1. OCR results are not as great as, say, acrobat for some of these (see above link).
2. None of the programs take in a PDF file and output a PDF file. You have to create a script and break apart the PDF first and run the programs over each and then reassemble the file as a pdf
3. Once you do, you may find, like I did, that (tesseract) creates an OCR layer that is shifted over. So if you search for the word 'the', you'll get a highlight of the part of the word next to it.
Batch DjVu → Convert to PDF – haven't looked into it, but seems like a horrible round-a-bout solution.

Online

PDFcubed.com – come on, not really a batch solution.
ABBYY Cloud OCR - not sure if this is really a batch solution, either way, you have to pay by the page and this could get quite pricey.

Identifying non-OCRed PDFs

This is a slightly easier problem, that can be solved easily in Linux and much less so in Windows. I was able to code a perl script using pdffont to identify whether fonts are embedded to determine which files are not-OCRed.

Current "solutions"

Use a script to identify non-OCRed pdfs (so you don't rerun over thousands of OCRed PDFs) and copy these to a temporary directory (retaining the correct directory tree) and then use Acrobat on Windows to run over these hoping that the smaller batches won't crash.
use the same script but get one of the linux ocr tools to properly work, risking ocr quality.

I think I'm going to try #1, I'm just worried too much about the results of the Linux OCR tools (I don't suppose anyone has done a comparison) and breaking the files apart and stitching them together again seems to be unnecessary coding if Adobe can actually batch OCR a directory without choking.

If you want a completely free solution, you'll have to use a script to identify the non-OCRed pdfs (or just rerun over OCRed ones), and then use one of the linux tools to try and OCR them. Teseract seems to have the best results, but again, some of these tools are not supported well in modern versions of Ubuntu, though if you can set it up and fix the problem I had where the image layer not matching the text-matching layer (with tesseract) then you would have a pretty workable solution and once again Linux > Windows.

Do you have a working solution to fully automate, batch OCR PDFs, skipping already OCRed files keeping the same name, with high quality? If so, I would really appreciate the input.

Perl script to move non-OCRed files to a temp directory. Can't guarantee this works and probably need to be rewritten, but if someone makes it work (assuming it doesn't work) or work better, let me know and I'll post a better version here.


#!/usr/bin/perl

# move non-ocred files to a directory
# change variables below, you need a base dir (like /home/joe/), and a sourcedirectory and output
# direcotry (e.g books and tempdir)
# move all your pdfs to the sourcedirectory

use warnings;
use strict;

# need to install these modules with CPAN or your distros installer (e.g. apt-get)
use CAM::PDF;
use File::Find;
use File::Basename;
use File::Copy;

#use PDF::OCR2;
#$PDF::OCR2::CHECK_PDF   = 1;
#$PDF::OCR2::REPAIR_XREF = 1;

my $basedir = '/your/base/directory';
my $sourcedirectory  = $basedir.'/books/';
my @exts       = qw(.pdf);
my $count      = 0;
my $outputroot = $basedir.'/tempdir/';
open( WRITE, >>$basedir.'/errors.txt' );

#check file
#my $pdf = PDF::OCR2->new($basedir.'/tempfile.pdf');
#print $pdf->page(10)->text;



find(
    {
        wanted => \&process_file,

        #       no_chdir => 1
    },
    $sourcedirectory
);
close(WRITE);

sub process_file {
    #must be a file
    if ( -f $_ ) {
        my $file = $_;
        #must be a pdf
        my ( $dir, $name, $ext ) = fileparse( $_, @exts );
        if ( $ext eq '.pdf' ) {
            #check if pdf is ocred
            my $command = "pdffonts \'$file\'";
            my $output  = `$command`;
            if ( !( $output =~ /yes/ || $output =~ /no/ ) ) {
                #print "$file - Not OCRed\n";
                my $currentdir = $File::Find::dir;
                if ( $currentdir =~ /$sourcedirectory(.+)/ ) {
                    #if directory doesn't exist, create
                    unless(-d $outputroot.$1){
                    system("mkdir -p $outputroot$1");
                    }
                    #copy over file
                    my $fromfile = "$currentdir/$file";
                    my $tofile = "$outputroot$1/$file";
                    print "copy from: $fromfile\n";
                    print "copy to: $tofile\n";
                    copy($fromfile, $tofile) or die "Copy failed: $!";
#                       `touch $outputroot$1/\'$file\'`;
                }
            }

        }

    }
}

Answers

kiwi

I too have looked for a way to batch-OCR many PDFs in an automated manner, without much luck. In the end I have come up with a workable solution similar to yours, using Acrobat with a script as follows:

Copy all relevant PDFs to a specific directory.
Remove PDFs already containing text (assuming they are already OCRd or already text - not ideal I know, but good enough for now).
Use AutoHotKey to automatically run Acrobat, select the specific directory, and OCR all documents, appending "-ocr" to their filename.
Move the OCRd PDFs back to their original location, using the presence of a "-ocr.pdf" file to determine whether it was successful.

It is a bit Heath Robinson, but actually works pretty well.

Nikolay

I beleive you need to realize that ABBYY FineReader is an end-user solution designed to provide fast&accurate out-of-the-box OCR.

Based on my experience, OCR projects have significatly different details each time and there's no way create an out of the box soulition for each unique case.But i can suggest you more professional tools that can do the job for you:

Have a look at ABBYY Recognition Server, this is a professional product for OCR automatiation.
When comes to linux, have a look at http://ocr4linux.com, it's a command line utility that may fit you as well.
For more complex tasks ABBYY has a very flexible SDKs like ABBYY FineReader Engine (in-house hosted) or ABBYY Cloud OCR SDK (based on Microsoft Azure cloud), that let you desing OCR processing the way you want it.

I was a part of the front-end development team for the cloud service specified above and can provide more info on it if necessary.

Considering the lookup of a text layer in PDF, i can't give any advice on that, because this task is a bit aside of OCR which is my specialty, so i find your approach of using external script very reasonable. Maybe you'll find this discussion helpful: http://forum.ocrsdk.com/questions/108/check-if-pdf-is-scanned-image-or-contains-text

Sancho

I have tried ABBYY watch folder and its performance is deplorable. If it isn't crashing (which is often) it is reencoding the PDF images, which translates to (1) degradation in image quality and (2) inexcusably large file sizes. In addition, the entire image appears to be shrunken in size (the output pdf has larger margins). I'm no expert, but it would seem that the best solution would be to KEEP the original image and simply interlineate the recognized text under the image.

Neil Pitman

You could consider Aquaforest's Autobahn DX : http://www.aquaforest.com/en/autobahn.asp

It is designed to process batches of PDFs and has a variety of options (eg Skip or pass-through OCRed files) as well as options for smart treatment of PDFs which may offer a better result (eg if a PDF has some image pages and some text pages, it can just OCR the image pages)

Related Answers

Jukka Matilainen

I have had success with the BSD-licensed Linux port of Cuneiform OCR system.

No binary packages seem to be available, so you need to build it from source. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP).

While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. This way you can create "searchable" PDFs from which you can copy text.

I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them:

#!/bin/bash
# Run OCR on a multi-page PDF file and create a new pdf with the
# extracted text in hidden layer. Requires cuneiform, hocr2pdf, gs.
# Usage: ./dwim.sh input.pdf output.pdf

set -e

input="$1"
output="$2"

tmpdir="$(mktemp -d)"

# extract images of the pages (note: resolution hard-coded)
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"

# OCR each page individually and convert into PDF
for page in "$tmpdir"/page-*.tiff
do
    base="${page%.tiff}"
    cuneiform -f hocr -o "$base.html" "$page"
    hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
done

# combine the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf

rm -rf -- "$tmpdir"

Please note that the above script is very rudimentary. For example, it does not retain any PDF metadata.

nagul

See if pdftotext will work for you. If it's not on your machine, you'll have to install the poppler-utils package

sudo apt-get install poppler-utils

You might also find the pdf toolkit of use.

A full list of pdf software here on wikipedia.

Edit: Since you do need OCR capabilities, I think you'll have to try a different tack. (i.e I couldn't find a linux pdf2text converter that does OCR).

Convert the pdf to an image
Scan the image to text using OCR tools

Convert pdf to image

gs: The below command should convert multipage pdf to individual tiff files.

gs -SDEVICE=tiffg4 -r600x600 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH -- filename
ImageMagik utilities: There are other questions on the SuperUser site about using ImageMagik that you might use to help you do the conversion.

convert foo.pdf foo.png

Convert image to text with OCR

Taken from the Wikipedia's list of OCR software

Ryan Thompson

If you can convert the PDF pages to images, then you can use any OCR tool you like on them. I've had the best results with tesseract.

syntaxerror

Google docs will now use OCR to convert your uploaded image/pdf documents to text. I have had good success with it.

They are using the OCR system that is used for the gigantic Google Books project.

However, it must be noted that only PDFs to a size of 2 MB will be accepted for processing.

rlangner

Try WatchOCR. It is an open source software package that converts scanned images into text searchable pdfs. It is free and open source and has a nice web interface for remote administration.

scruss

PDFBeads works well for me. This thread “Convert Scanned Images to a Single PDF File” got me up and running. For a b&w book scan, you need to:

Create an image for every page of the PDF; either of the gs examples above should work
Generate hOCR output for each page; I used tesseract (but note that Cuneiform seems to work better).
Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
In the new folder, run
```
pdfbeads * > ../Output.pdf
```

This will put the collated, OCR'd PDF in the parent directory.

shootingstars

Geza Kovacs has made an Ubuntu package that is basically a script using hocr2pdf as Jukka suggested, but makes things a bit faster to setup.

From Geza's Ubuntu forum post with details on the package...

Adding the repository and installing in Ubuntu

sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

Running ocr on a file

pdfocr -i input.pdf -o output.pdf

GitHub repository for the code https://github.com/gkovacs/pdfocr/

tolima

another script using tesseract :

#!/bin/bash
# Run OCR on a multi-page PDF file and create a txt with the
# extracted text in hidden layer. Requires tesseract, gs.
# Usage: ./pdf2ocr.sh input.pdf output.txt

set -e

input="$1"
output="$2"

tmpdir="$(mktemp -d)"

# extract images of the pages (note: resolution hard-coded)
gs -SDEVICE=tiff24nc -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"

# OCR each page individually and convert into PDF
for page in "$tmpdir"/page-*.tiff
do
    base="${page%.tiff}"
    tesseract "$base.tiff" $base
done

# combine the pages into one txt
cat "$tmpdir"/page-*.txt > $output

rm -rf -- "$tmpdir"

Home