windows - Batch-OCR many PDFs

07
2014-07
  • Joe

    This has been discussed a year ago here:

    Batch OCR for many PDF files (not already OCRed)?

    Is there any way to batch OCR PDFs that haven't been already OCRed? This is, I think, the current state of things dealing with two issues:

    Batch OCR PDFs

    Windows

    • Acrobat – This is the most straightfoward ocr engine that will batch OCR. The only problem seems to be 1) it wont skip files that have already been OCRed 2) try throwing a bunch of PDFs at it (some old) and watch it crash. It is a little buggy. It will warn you at each error it runs into (though you can tell the software to not notify. But again, it dies horribly on certain types of PDFs so your mileage may vary.

    • ABBYY FineReader (Batch/Scansnap), Omnipage – These have got to be some of the worst programmed pieces of software known to man. If you can find out how to fully automate (no prompting) batch OCR of PDFs saving with the same name then please post here. It seems the only solutions I could find failed somewhere--renaming, not fully automated, etc. etc. At best, there is a way to do it, but the documentation and programming is so horrible that you'll never find out.

    • ABBYY FineReader Engine, ABBYY Recognition Server - These really are more enterprise solutions, you probably would be better off just getting acrobat to run over a folder and try and weed out pdfs that give you errors/crash the program than going through the hassle of trying to install evaluation software (assuming you are a simple end-user). Doesn't seem cost competitive for the small user.

    • ** Autobahn DX workstation ** the cost of this product is so prohibitive, you probably could buy 6 copies of acrobat. Not really an end-user solution. If you're an enterprise setup, this may be worth it for you.

    Linux

    • WatchOCR – no longer developed, and basically impossible to run on modern Ubuntu distros
    • pdfsandwich – no longer developed, basically impossible to run on modern Ubuntu distros
    • ** ABBY LINUX OCR ** - this should be scriptable, and seems to have some good results:

    http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

    However, like a lot of these other ABBYY products they charge by the page, again, you might be better off trying to get Acrobat Batch OCR to work.

    • *Ocrad, GOCR, OCRopus, tesseract, * – these may work but there are a few problems:

      1. OCR results are not as great as, say, acrobat for some of these (see above link).
      2. None of the programs take in a PDF file and output a PDF file. You have to create a script and break apart the PDF first and run the programs over each and then reassemble the file as a pdf
      3. Once you do, you may find, like I did, that (tesseract) creates an OCR layer that is shifted over. So if you search for the word 'the', you'll get a highlight of the part of the word next to it.
    • Batch DjVu → Convert to PDF – haven't looked into it, but seems like a horrible round-a-bout solution.

    Online

    • PDFcubed.com – come on, not really a batch solution.
    • ABBYY Cloud OCR - not sure if this is really a batch solution, either way, you have to pay by the page and this could get quite pricey.

    Identifying non-OCRed PDFs

    This is a slightly easier problem, that can be solved easily in Linux and much less so in Windows. I was able to code a perl script using pdffont to identify whether fonts are embedded to determine which files are not-OCRed.


    Current "solutions"

    1. Use a script to identify non-OCRed pdfs (so you don't rerun over thousands of OCRed PDFs) and copy these to a temporary directory (retaining the correct directory tree) and then use Acrobat on Windows to run over these hoping that the smaller batches won't crash.

    2. use the same script but get one of the linux ocr tools to properly work, risking ocr quality.

    I think I'm going to try #1, I'm just worried too much about the results of the Linux OCR tools (I don't suppose anyone has done a comparison) and breaking the files apart and stitching them together again seems to be unnecessary coding if Adobe can actually batch OCR a directory without choking.

    If you want a completely free solution, you'll have to use a script to identify the non-OCRed pdfs (or just rerun over OCRed ones), and then use one of the linux tools to try and OCR them. Teseract seems to have the best results, but again, some of these tools are not supported well in modern versions of Ubuntu, though if you can set it up and fix the problem I had where the image layer not matching the text-matching layer (with tesseract) then you would have a pretty workable solution and once again Linux > Windows.


    Do you have a working solution to fully automate, batch OCR PDFs, skipping already OCRed files keeping the same name, with high quality? If so, I would really appreciate the input.


    Perl script to move non-OCRed files to a temp directory. Can't guarantee this works and probably need to be rewritten, but if someone makes it work (assuming it doesn't work) or work better, let me know and I'll post a better version here.

    
    #!/usr/bin/perl
    
    # move non-ocred files to a directory
    # change variables below, you need a base dir (like /home/joe/), and a sourcedirectory and output
    # direcotry (e.g books and tempdir)
    # move all your pdfs to the sourcedirectory
    
    use warnings;
    use strict;
    
    # need to install these modules with CPAN or your distros installer (e.g. apt-get)
    use CAM::PDF;
    use File::Find;
    use File::Basename;
    use File::Copy;
    
    #use PDF::OCR2;
    #$PDF::OCR2::CHECK_PDF   = 1;
    #$PDF::OCR2::REPAIR_XREF = 1;
    
    my $basedir = '/your/base/directory';
    my $sourcedirectory  = $basedir.'/books/';
    my @exts       = qw(.pdf);
    my $count      = 0;
    my $outputroot = $basedir.'/tempdir/';
    open( WRITE, >>$basedir.'/errors.txt' );
    
    #check file
    #my $pdf = PDF::OCR2->new($basedir.'/tempfile.pdf');
    #print $pdf->page(10)->text;
    
    
    
    find(
        {
            wanted => \&process_file,
    
            #       no_chdir => 1
        },
        $sourcedirectory
    );
    close(WRITE);
    
    sub process_file {
        #must be a file
        if ( -f $_ ) {
            my $file = $_;
            #must be a pdf
            my ( $dir, $name, $ext ) = fileparse( $_, @exts );
            if ( $ext eq '.pdf' ) {
                #check if pdf is ocred
                my $command = "pdffonts \'$file\'";
                my $output  = `$command`;
                if ( !( $output =~ /yes/ || $output =~ /no/ ) ) {
                    #print "$file - Not OCRed\n";
                    my $currentdir = $File::Find::dir;
                    if ( $currentdir =~ /$sourcedirectory(.+)/ ) {
                        #if directory doesn't exist, create
                        unless(-d $outputroot.$1){
                        system("mkdir -p $outputroot$1");
                        }
                        #copy over file
                        my $fromfile = "$currentdir/$file";
                        my $tofile = "$outputroot$1/$file";
                        print "copy from: $fromfile\n";
                        print "copy to: $tofile\n";
                        copy($fromfile, $tofile) or die "Copy failed: $!";
    #                       `touch $outputroot$1/\'$file\'`;
                    }
                }
    
            }
    
        }
    }
    
  • Answers
  • kiwi

    I too have looked for a way to batch-OCR many PDFs in an automated manner, without much luck. In the end I have come up with a workable solution similar to yours, using Acrobat with a script as follows:

    1. Copy all relevant PDFs to a specific directory.

    2. Remove PDFs already containing text (assuming they are already OCRd or already text - not ideal I know, but good enough for now).

    3. Use AutoHotKey to automatically run Acrobat, select the specific directory, and OCR all documents, appending "-ocr" to their filename.

    4. Move the OCRd PDFs back to their original location, using the presence of a "-ocr.pdf" file to determine whether it was successful.

    It is a bit Heath Robinson, but actually works pretty well.

  • Nikolay

    I beleive you need to realize that ABBYY FineReader is an end-user solution designed to provide fast&accurate out-of-the-box OCR.

    Based on my experience, OCR projects have significatly different details each time and there's no way create an out of the box soulition for each unique case.But i can suggest you more professional tools that can do the job for you:

    I was a part of the front-end development team for the cloud service specified above and can provide more info on it if necessary.

    Considering the lookup of a text layer in PDF, i can't give any advice on that, because this task is a bit aside of OCR which is my specialty, so i find your approach of using external script very reasonable. Maybe you'll find this discussion helpful: http://forum.ocrsdk.com/questions/108/check-if-pdf-is-scanned-image-or-contains-text

  • Sancho

    I have tried ABBYY watch folder and its performance is deplorable. If it isn't crashing (which is often) it is reencoding the PDF images, which translates to (1) degradation in image quality and (2) inexcusably large file sizes. In addition, the entire image appears to be shrunken in size (the output pdf has larger margins). I'm no expert, but it would seem that the best solution would be to KEEP the original image and simply interlineate the recognized text under the image.

  • Neil Pitman

    You could consider Aquaforest's Autobahn DX : http://www.aquaforest.com/en/autobahn.asp

    It is designed to process batches of PDFs and has a variety of options (eg Skip or pass-through OCRed files) as well as options for smart treatment of PDFs which may offer a better result (eg if a PDF has some image pages and some text pages, it can just OCR the image pages)


  • Related Question

    ubuntu - How to extract text with OCR from a PDF on Linux?
  • obvio171

    How do I extract text from a PDF that wasn't built with an index? It's all text, but I can't search or select anything. I'm running Kubuntu, and Okular doesn't have this feature.


  • Related Answers
  • Jukka Matilainen

    I have had success with the BSD-licensed Linux port of Cuneiform OCR system.

    No binary packages seem to be available, so you need to build it from source. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP).

    While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. This way you can create "searchable" PDFs from which you can copy text.

    I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them:

    #!/bin/bash
    # Run OCR on a multi-page PDF file and create a new pdf with the
    # extracted text in hidden layer. Requires cuneiform, hocr2pdf, gs.
    # Usage: ./dwim.sh input.pdf output.pdf
    
    set -e
    
    input="$1"
    output="$2"
    
    tmpdir="$(mktemp -d)"
    
    # extract images of the pages (note: resolution hard-coded)
    gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"
    
    # OCR each page individually and convert into PDF
    for page in "$tmpdir"/page-*.tiff
    do
        base="${page%.tiff}"
        cuneiform -f hocr -o "$base.html" "$page"
        hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
    done
    
    # combine the pages into one PDF
    gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf
    
    rm -rf -- "$tmpdir"
    

    Please note that the above script is very rudimentary. For example, it does not retain any PDF metadata.

  • nagul

    See if pdftotext will work for you. If it's not on your machine, you'll have to install the poppler-utils package

    sudo apt-get install poppler-utils
    

    You might also find the pdf toolkit of use.

    A full list of pdf software here on wikipedia.

    Edit: Since you do need OCR capabilities, I think you'll have to try a different tack. (i.e I couldn't find a linux pdf2text converter that does OCR).

    • Convert the pdf to an image
    • Scan the image to text using OCR tools

    Convert pdf to image

    • gs: The below command should convert multipage pdf to individual tiff files.

      gs -SDEVICE=tiffg4 -r600x600 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH -- filename

    • ImageMagik utilities: There are other questions on the SuperUser site about using ImageMagik that you might use to help you do the conversion.

      convert foo.pdf foo.png

    Convert image to text with OCR

    Taken from the Wikipedia's list of OCR software

  • Ryan Thompson

    If you can convert the PDF pages to images, then you can use any OCR tool you like on them. I've had the best results with tesseract.

  • syntaxerror

    Google docs will now use OCR to convert your uploaded image/pdf documents to text. I have had good success with it.

    They are using the OCR system that is used for the gigantic Google Books project.

    However, it must be noted that only PDFs to a size of 2 MB will be accepted for processing.

  • rlangner

    Try WatchOCR. It is an open source software package that converts scanned images into text searchable pdfs. It is free and open source and has a nice web interface for remote administration.

  • scruss

    PDFBeads works well for me. This thread “Convert Scanned Images to a Single PDF File” got me up and running. For a b&w book scan, you need to:

    1. Create an image for every page of the PDF; either of the gs examples above should work
    2. Generate hOCR output for each page; I used tesseract (but note that Cuneiform seems to work better).
    3. Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
    4. In the new folder, run

      pdfbeads * > ../Output.pdf
      

    This will put the collated, OCR'd PDF in the parent directory.

  • shootingstars

    Geza Kovacs has made an Ubuntu package that is basically a script using hocr2pdf as Jukka suggested, but makes things a bit faster to setup.

    From Geza's Ubuntu forum post with details on the package...

    Adding the repository and installing in Ubuntu

    sudo add-apt-repository ppa:gezakovacs/pdfocr
    sudo apt-get update
    sudo apt-get install pdfocr
    

    Running ocr on a file

    pdfocr -i input.pdf -o output.pdf
    

    GitHub repository for the code https://github.com/gkovacs/pdfocr/

  • tolima

    another script using tesseract :

    #!/bin/bash
    # Run OCR on a multi-page PDF file and create a txt with the
    # extracted text in hidden layer. Requires tesseract, gs.
    # Usage: ./pdf2ocr.sh input.pdf output.txt
    
    set -e
    
    input="$1"
    output="$2"
    
    tmpdir="$(mktemp -d)"
    
    # extract images of the pages (note: resolution hard-coded)
    gs -SDEVICE=tiff24nc -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"
    
    # OCR each page individually and convert into PDF
    for page in "$tmpdir"/page-*.tiff
    do
        base="${page%.tiff}"
        tesseract "$base.tiff" $base
    done
    
    # combine the pages into one txt
    cat "$tmpdir"/page-*.txt > $output
    
    rm -rf -- "$tmpdir"