ocr - How to automatically find non-searchable PDFs

07
2014-07
  • Brian Z

    Suppose I have a directory full of many PDFs. In most of them, the text is completely search-able, which is the way I need them to be. But a few of them are just image scans, and they need to be OCR-ed.

    Other then simply doing a batch OCR on the entire directory, is there a way to quickly identify which PDFs are the image-only ones that actually need to be OCR-ed?

    I'm not a programmer, but a linux-friendly solution would be preferred.

  • Answers
  • Glutanimate

    I'm not sure if this is a 100% solution, but I came up with the following script which should get you a good part of the way if not the whole way (I have not gone through the spec) It should be run from the directory which has all the PDF's (it will search subdirectories).

    #! /bin/bash
    
    if [[ ! "$#" = "1" ]]
      then
          echo "Usage: $0 /path/to/PDFDirectory"
          exit 1
    fi
    
    PDFDIRECTORY="$1"
    
    while IFS= read -r -d $'\0' FILE; do
        PDFFONTS_OUT="$(pdffonts "$FILE" 2>/dev/null)"
        RET_PDFFONTS="$?"
        FONTS="$(( $(echo "$PDFFONTS_OUT" | wc -l) - 2 ))"
        if [[ ! "$RET_PDFFONTS" = "0" ]]
          then
              READ_ERROR=1
              echo "Error while reading $FILE. Skipping..."
              continue
        fi
        if [[ "$FONTS" = "0" ]]
          then
              echo "NOT SEARCHABLE: $FILE"
          else
              echo "SEARCHABLE: $FILE"
        fi
    done < <(find "$PDFDIRECTORY" -type f -name '*.pdf' -print0)
    
    echo "Done."
    if [[ "$READ_ERROR" = "1" ]]
      then
          echo "There were some errors."
    fi
    

    It works by looking for the number of fonts specified in each PDF. If the file does not have any fonts it is assumed to be comprised only of an image. (This might trip up on password protected files, I have no idea, don't have any to test against). If there is some stuff which is searchable and some stuff which is an image, this won't work - but it will probably be useful to seperate scanned image documents in a PDF container from "real" PDF's.

    You can, of-course, comment out the part of the if-then-else loop which does not apply if you only want to print out the files which are not searchable.


  • Related Question

    Batch OCR for many PDF files (not already OCRed)?
  • Erb

    I use Google Desktop Search (I am on Vista) and not all my PDF files are recognized in my archive folder. It is normal as "PDF files that contain scanned images" are not indexed ( http://desktop.google.com/support/bin/answer.py?hl=en&answer=90651 )

    So I would like to OCR many of my PDF files that are not already OCRed. My goal : I give the program a folder and it search alone in the subfolders the PDF files that need to be converted into PDF-OCRed files.

    Note: In the past, if a PDF file was password protected, I removed the password with another batch (paying) tool: verypdf.com "pwdremover" http://www.verypdf.com/pwdremover/

    Any (not too much expensive) idea ?

    I already tried : Finereader 6 pro on xp at the time, but there was no batch processor included... Paperfile paperfile.net which uses Tesseract http://code.google.com/p/tesseract-ocr/ . But the OCR is only PDF to text, not PDF to PDF! There is also another project http://code.google.com/p/ocropus/

    Thanks in advance ;)


  • Related Answers
  • Darth Android

    Adobe Acrobat will process a folder of PDFs and like most Adobe products there's a 30 day trial.
    The function is located in the 'Document' menu:

    Document > OCR Text Regocnition > Recognise text in multiple files using OCR

    from where you can add your folder.

    In Acrobat X the function is available as follows:

    Tools > Recognize Text > In Multiple Files
  • rlangner

    Try WatchOCR. It is an open source software package that converts scanned images into text searchable pdfs. It is free and open source and has a nice web interface for remote administration. With the right configuration it be used to create a batch pdf/ocr service for an entire network via smb shares. Unfortunately it is linux only. But you could install it on an old server and then your entire organisation could use it.

    If you want to do the same online without installing anything, try PDFCubed.com

  • Brian Z

    Actually, pdfsandwich has been updated within the last year and was not at all difficult for me to install in Linux Mint. The results it gives are inferior to Adobe Acrobat, but it's the only workable solution I've found in Linux so far.