Replace (OCR) garbled text in PDF?

pdf ocr

22
2013-10

TheLostOne

Now and then I come across a PDF which looks fine in first place but the underlying text is garbled. Currently I have a PDF where only the headings are garbled and I would like to know if it's possible to somehow replace the garbled text with the correct one.

I thought about OCRing the specific pages but this only works if I convert the page to an image.

How can I manually correct the underlying text or the re-OCR only specific parts?

Answers

Dmitri

In Acrobat Pro: Veiw --> Tools --> Recognize Text

will bring up the OCR tool bar. From there, use the "OCR Suspects" tools to correct the errors in the PDF.

Related Answers

Jukka Matilainen

I have had success with the BSD-licensed Linux port of Cuneiform OCR system.

No binary packages seem to be available, so you need to build it from source. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP).

While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. This way you can create "searchable" PDFs from which you can copy text.

I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them:

#!/bin/bash
# Run OCR on a multi-page PDF file and create a new pdf with the
# extracted text in hidden layer. Requires cuneiform, hocr2pdf, gs.
# Usage: ./dwim.sh input.pdf output.pdf

set -e

input="$1"
output="$2"

tmpdir="$(mktemp -d)"

# extract images of the pages (note: resolution hard-coded)
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH -- "$input"

# OCR each page individually and convert into PDF
for page in "$tmpdir"/page-*.tiff
do
    base="${page%.tiff}"
    cuneiform -f hocr -o "$base.html" "$page"
    hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
done

# combine the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf

rm -rf -- "$tmpdir"

Please note that the above script is very rudimentary. For example, it does not retain any PDF metadata.

nagul

See if pdftotext will work for you. If it's not on your machine, you'll have to install the poppler-utils package

sudo apt-get install poppler-utils

You might also find the pdf toolkit of use.

A full list of pdf software here on wikipedia.

Edit: Since you do need OCR capabilities, I think you'll have to try a different tack. (i.e I couldn't find a linux pdf2text converter that does OCR).

Convert the pdf to an image
Scan the image to text using OCR tools

Convert pdf to image

gs: The below command should convert multipage pdf to individual tiff files.

gs -SDEVICE=tiffg4 -r600x600 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH -- filename
ImageMagik utilities: There are other questions on the SuperUser site about using ImageMagik that you might use to help you do the conversion.

convert foo.pdf foo.png

Convert image to text with OCR

Taken from the Wikipedia's list of OCR software

Ryan Thompson

If you can convert the PDF pages to images, then you can use any OCR tool you like on them. I've had the best results with tesseract.

syntaxerror

Google docs will now use OCR to convert your uploaded image/pdf documents to text. I have had good success with it.

They are using the OCR system that is used for the gigantic Google Books project.

However, it must be noted that only PDFs to a size of 2 MB will be accepted for processing.

rlangner

Try WatchOCR. It is an open source software package that converts scanned images into text searchable pdfs. It is free and open source and has a nice web interface for remote administration.

scruss

PDFBeads works well for me. This thread “Convert Scanned Images to a Single PDF File” got me up and running. For a b&w book scan, you need to:

Create an image for every page of the PDF; either of the gs examples above should work
Generate hOCR output for each page; I used tesseract (but note that Cuneiform seems to work better).
Move the images and the hOCR files to a new folder; the filenames must correspond, so file001.tif needs file001.html, file002.tif file002.html, etc.
In the new folder, run
```
pdfbeads * > ../Output.pdf
```

This will put the collated, OCR'd PDF in the parent directory.

Home

Replace (OCR) garbled text in PDF?