ubuntu - How to extract text with OCR from a PDF on Linux?

22
2013-10
  • obvio171

    How do I extract text from a PDF that wasn't built with an index? It's all text, but I can't search or select anything. I'm running Kubuntu, and Okular doesn't have this feature.

  • Answers
  • Jukka Matilainen

    I have had success with the BSD-licensed Linux port of Cuneiform OCR system.

    No binary packages seem to be available, so you need to build it from source. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP).

    While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. This way you can create "searchable" PDFs from which you can copy text.

    I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them:

    #!/bin/bash
    # Run OCR on a multi-page PDF file and create a new pdf with the
    # extracted text in hidden layer. Requires cuneiform, hocr2pdf, gs.
    # Usage: ./dwim.sh input.pdf output.pdf
    
    set -e
    
    input="$1"
    output="$2"
    
    tmpdir="$(mktemp -d)"
    
    # extract images of the pages (note: resolution hard-coded)
    gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"
    
    # OCR each page individually and convert into PDF
    for page in "$tmpdir"/page-*.tiff
    do
        base="${page%.tiff}"
        cuneiform -f hocr -o "$base.html" "$page"
        hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
    done
    
    # combine the pages into one PDF
    gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf
    
    rm -rf -- "$tmpdir"
    

    Please note that the above script is very rudimentary. For example, it does not retain any PDF metadata.

  • nagul

    See if pdftotext will work for you. If it's not on your machine, you'll have to install the poppler-utils package

    sudo apt-get install poppler-utils
    

    You might also find the pdf toolkit of use.

    A full list of pdf software here on wikipedia.

    Edit: Since you do need OCR capabilities, I think you'll have to try a different tack. (i.e I couldn't find a linux pdf2text converter that does OCR).

    • Convert the pdf to an image
    • Scan the image to text using OCR tools

    Convert pdf to image

    • gs: The below command should convert multipage pdf to individual tiff files.

      gs -SDEVICE=tiffg4 -r600x600 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH -- filename

    • ImageMagik utilities: There are other questions on the SuperUser site about using ImageMagik that you might use to help you do the conversion.

      convert foo.pdf foo.png

    Convert image to text with OCR

    Taken from the Wikipedia's list of OCR software

  • Ryan Thompson

    If you can convert the PDF pages to images, then you can use any OCR tool you like on them. I've had the best results with tesseract.

  • syntaxerror

    Google docs will now use OCR to convert your uploaded image/pdf documents to text. I have had good success with it.

    They are using the OCR system that is used for the gigantic Google Books project.

    However, it must be noted that only PDFs to a size of 2 MB will be accepted for processing.

  • rlangner

    Try WatchOCR. It is an open source software package that converts scanned images into text searchable pdfs. It is free and open source and has a nice web interface for remote administration.

  • scruss

    PDFBeads works well for me. This thread “Convert Scanned Images to a Single PDF File” got me up and running. For a b&w book scan, you need to:

    1. Create an image for every page of the PDF; either of the gs examples above should work
    2. Generate hOCR output for each page; I used tesseract (but note that Cuneiform seems to work better).
    3. Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
    4. In the new folder, run

      pdfbeads * > ../Output.pdf
      

    This will put the collated, OCR'd PDF in the parent directory.


  • Related Question

    How can I extract text from a table in a PDF file?
  • Nathan Fellman

    I am trying to implement an algorithm described in an academic paper, which I have in PDF format. The algorithm includes a table of 256 entries that I want to copy to my implementation. However, I can't seem to copy the table as text that I can manipulate. I can only copy it as an image.

    How can I extract the table easily without typing it in?


  • Related Answers
  • Ivo Flipse

    PDF2Table

    This gives it out to XML I think.

    If we surf the web we can find PDF files in heaps. Once technical details of an amazing five mega pixel digital camera, once a statistic about the last two years incomes of an enterprise, and once a brilliant crime novel of Sir Arthur Conan Doyle is saved in a PDF file. The widespread use of this file format takes the focus on the question of how to reuse the data in such a file. Many things are already done in this area. For example, there are several tools that convert PDF-files to other formats.

    My work focuses only on the extraction of table information from PDF-files. I searched for tools that extract basic information from PDF-files. I found a tool named pdf2html which also returns data in XML format. To access this XML output I used the JDOM archive.

    I developed several heuristics for table detection and decomposition. These heuristics work pretty good on lucid tables (without spanning columns or rows) and fairly good on complex tables (with spanning rows or columns).

    Sourceforge link

  • Toby Allen

    Your problem might be that it was pasted into the pdf as an image by the origional author. If this is the case (you could find out by seeing if other text in the document will copy as text) your only options are probably to copy it by hand (hope you can touch type) or use OCR software that comes with scanners.

  • Synetech

    I haven't tried this, but the pdf2table project, might help.

  • Matthew Lock

    The non-free application PDF2XL and the free PDF Mechanic can both extract tabular data to CSV and Excel often perfectly depending on the exact formatting of the table.

  • Matt Jans

    One option seems to be to save the document (or maybe just the page with the table you want) as an xml file. I just did this in Adobe Acrobrat Pro by saving as "XML Spreadsheet 2003." This retained the tabular format in the resulting xml file (viewable in Excel). The only "imperfection" is that it considers each literal row in the table as a row in the Excel file. So if any text breaks across rows (e.g., long names), then it will show up as two rows in excel. For a small table, that's pretty minor cleanup.

    Other than that, it seems like this process could be automated.