How to improve quality of PDF text files on Mac

07
2014-07
  • flow

    I have a text document in PDF format, 200 pages, which I need to edit. I thought about doing OCR, but the quality is not really great, and OCR does not work well. Then I wonder whether I could somehow improve the quality of the PDF file so that OCR works better next time; is there a way to improve the quality/resolution of PDF text files on Mac?

    PS: text can not be copied/pasted from the PDF file, it is only images.

  • Answers
  • Jangari

    I doubt it. You could try putting each page through photoshop or gimp or something and running some kind of sharpen filter, but honestly, having done this several times, it's just easier to clean up the text of a bad OCR read than it is to clean the input.

    Play around with different OCR engines to see if they produce different results. I used Tesseract for a job that required me to OCR and clean some 80 pages of text from several different languages, and that seemed to handle it fairly well.

    It's time-consuming, tedious work, but there really is no easier way.


  • Related Question

    automation - Cleaning text in PDFs using OCR... or "There should definitely be a way to do this"
  • tel

    I get a lot of PDFs from other people consisting of scanned old documents. Unfortunately, sometimes the text on the scans, though legible, looks grainy and is hard to read.

    What I've been able to do so far is to extract the text, using OCR, into a word document. However, since these old documents often have illustrations and intricate formatting, what I'd really like to be able to do is to just remove the old grainy text and substitute it with computer generated fonts. In other words, I'd like to preserve the PDF and the formatting of its pages to the greatest extent possible while "cleaning" up the text by replacing it with, say, times new roman.

    I've been looking online for a few days for a simple, automatable way to perform such a cleanup, and I haven't turned up anything so far. It definitely seems like there should be a way to do this, it doesn't seem that complicated, but maybe I'm overlooking some aspects of this problem that place it outside of what is currently doable with OCR.

    Any suggestions?


  • Related Answers
  • user230879

    Even Adobe's own software is not good at doing this or making clear how to do it.

    With Adobe Acrobat X, you can create a text layer through the menus (View | Tools | Recognize Text) or by click Tools in the toolbar and then Recognize Text in the Tools pane.

    You then have options to perform OCR on the document or find "suspects". The "suspects" are possible OCR results that don't look right (don't spellcheck?). Once you have gone through the suspects, there doesn't seem to be any way to access or edit the text layer again short of redoing the OCR.

    You can choose page ranges to limit OCR (e.g. if you have a multilingual document), but you can't limit it to a selection.

    Given that this is such a useful feature, it's disappointing that Adobe don't make it very user-friendly.

    Edit: Two other possible solutions.

    Adobe Acrobat using ClearScan

    When you perform OCR with Adobe Acrobat you can change the PDF Output Style from the default Searchable Image format to ClearScan. This format will actually change the image as well, replacing characters with outlines derived from the OCR. This would both make your PDF more readable and add a text layer, but it does change the original image.

    Infix PDF Editor

    This program does seem to be able to display the text layer, but it still seems tricky fixing places where Adobe's OCR goes wrong (e.g. lone words in their own positioned para).

    Sadly none of these options are freely available.

  • Astara

    Depends on your exact circumstance (fonts used, diagrams, how much cleanup is needed...), but I have had good results with FineReader Professional Edition...Scans most common image formats (scan, tiff, jpg etc..) and can convert to html or word among others...

    It's not free, but you didn't say you were looking for that. I had a bunch of OCR stuff I was doing some time back, and it did a spectacular OCR job with a low error rate. <<<--- I don't know about today, but 5 years back when I first got this, I tried a few other OCR packages and the text recognition accuracy was generally 'abysmal'...though they would advertise it (correctly) as 90-95-98%. Problem is, even at 99%, you are looking at multiple words to correct/page of text. That was too high for my tolerance level.

    I fetl the raw retail rate was a bit pricey (but I usually like free, purchased SW better be worth it; I'm fluent in "gninux-ese"), but they have offers (or did when I bought), of upgrading from other software for about 50% of their retail price, which, also is about their upgrade price. I did buy it, though, when it was at about version 6 or 7, when I've had newer projects that required similar -- I bought an upgrade to the, then, current version. Last I purchased was 9.0.

    My only [obscure] beef w/it was it not recognizing Unicode and not producing unicode files. They do have 186 (reading from website) languages currently supported (AFAIK, all languages are included in Prof. Ver.), but it saved files in region-encoded character sets or 'code pages' (ibm-cp850, ms-cp1250, iso-8859-1, etc...) instead of UTF-8 -- which was my preference. I was scanning mixed-alphabet files that I would ultimately be editing in UTF-8.

    Their software does a great job with no training. It can be trained to recognize user-specific letters though I didn't find that process to be as convenient as I would have liked (but it really wasn't need for most of what I did (or do).

    With the version I have (9), it has the ability to read things off of a screen capture as well, which is occasionally convenient for programs that don't enable copy/paste.

    They appear to have a try-before-you-buy option, now, as well: website: finereader.abbyy.com (professional prod @ http://finereader.abbyy.com/professional).