encoding - Why does this PDF appear to encode parentheses correctly but doesn't when using pdftotext or copying and pasting?

08
2014-07
  • bariumbitmap

    I have several journal articles here:

    https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/A897.full.pdf

    https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/B1264.full.pdf

    https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/E23.full.pdf

    https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/Electrochem.%20Solid-State%20Lett.-2006-Nagao-A105-9.pdf

    They all encode parentheses (and other characters such as brackets) incorrectly. However, this is only apparent when trying to convert them to text or copy and paste. For example, the first line of the body of the first article should read:

    Proton exchange membrane fuel cells (PEMFCs) have received
    

    Instead, when copying and pasting from Acrobat Reader, it gives

    Proton exchange membrane fuel cells PEMFCs have received
    

    And when using "Save as text" it gives

    Proton exchange membrane fuel cells ^CPEMFCs�
    have received 
    

    Where the open parenthesis is ^C, the 03 ASCII control sequence, and the closing parenthesis is Unicode 65533, the replacement character, followed by a newline. Similarly, the pdf2txt and pdftotext utilities encode it as

    Proton exchange membrane fuel cells 共PEMFCs兲 have received
    

    (Unicode 20849 and 20850) and

    Proton exchange membrane fuel cells ͑PEMFCs͒ have received
    

    (Unicode 849 and 850).

    There's also Unicode 851 ( ͓), 852 ( ͔), 1003 (ϫ), 1011 (ϳ), 1015 (Ϸ), 8217 (’), 8211(–), 8722(−), 64257 (fi), 64258 (fl), and the control character Ctrl-L (ASCII 12) in the pdftotext output. Some of them could be normalized to ASCII pretty easily, but some of them will require manual mapping, I think.

    My questions are:

    1. What's the best way to fix this? I've seen some similar questions, for example, here: http://stackoverflow.com/questions/2700859/how-to-replace-unicode-characters-by-ascii-characters-in-python-perl-script-giv but I have had difficulties getting them to work.
    2. Why do different PDF readers and PDF to text utilities give such different results?

    Here's the outputs of pdfinfo and pdffonts:

    Title:          
    Subject:        
    Keywords:       
    Author:         
    Creator:        XPP
    Producer:       Acrobat Distiller 6.0.1 (Windows)
    CreationDate:   Thu Mar 23 12:07:23 2006
    ModDate:        Sun Nov  4 12:48:02 2012
    Tagged:         no
    Pages:          6
    Encrypted:      no
    Page size:      657 x 855 pts
    File size:      266467 bytes
    Optimized:      no
    PDF version:    1.4
    
    name                                 type              emb sub uni object ID
    ------------------------------------ ----------------- --- --- --- ---------
    Helvetica                            Type 1            no  no  no      89  0
    Helvetica-Oblique                    Type 1            no  no  no     109  0
    Helvetica-Bold                       Type 1            no  no  no      88  0
    LFNLKJ+Times-Bold                    Type 1C           yes yes no      63  0
    LFNLLK+Times-Italic                  Type 1C           yes yes no      64  0
    LFNLMK+Times-Roman                   Type 1C           yes yes no      65  0
    LFNLML+MathematicalPi-Three          Type 1C           yes yes no      66  0
    LFNLMM+MathematicalPi-One            Type 1C           yes yes no      67  0
    LFNLMN+Universal-GreekwithMathPi     Type 1C           yes yes no      72  0
    
  • Answers
    Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

    Related Question

    windows xp - How to preserve paragraph breaks when text copy from PDF and paste into Notepad?
  • metal gear solid

    For example when I copy text from PDF which has paragraph breaks like this:

    xxxxx xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
    xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
    xxxx xxxxxx
    
    xxxxx xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
    xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
    xxxx xxxxxx
    

    but when i copy text from PDF and paste into Notepad, Word 2007 etc. output comes without paragraph breaks.

    Like this:

    xxxxx xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
    xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
    xxxx xxxxxx
    xxxxx xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
    xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
    xxxx xxxxxx
    

    How to preserve paragraph breaks when text copy from PDF and paste into Notepad?


  • Related Answers
  • Ghostrider

    Unfortunately there is no concept of paragraphs (and therefore paragraph breaks) in PDF files.

    You may try to use some Word macro that would try to guess paragraphs based on line length (last line in a paragraph is usually shorter) than other lines.

    But otherwise, unfortunately, you are out of luck.