encoding - Why does this PDF appear to encode parentheses correctly but doesn't when using pdftotext or copying and pasting?

08
2014-07

bariumbitmap

I have several journal articles here:

https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/A897.full.pdf

https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/B1264.full.pdf

https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/E23.full.pdf

https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/Electrochem.%20Solid-State%20Lett.-2006-Nagao-A105-9.pdf

They all encode parentheses (and other characters such as brackets) incorrectly. However, this is only apparent when trying to convert them to text or copy and paste. For example, the first line of the body of the first article should read:

Proton exchange membrane fuel cells (PEMFCs) have received

Instead, when copying and pasting from Acrobat Reader, it gives

Proton exchange membrane fuel cells PEMFCs have received

And when using "Save as text" it gives

Proton exchange membrane fuel cells ^CPEMFCs�
have received

Where the open parenthesis is ^C, the 03 ASCII control sequence, and the closing parenthesis is Unicode 65533, the replacement character, followed by a newline. Similarly, the pdf2txt and pdftotext utilities encode it as

Proton exchange membrane fuel cells 共PEMFCs兲 have received

(Unicode 20849 and 20850) and

Proton exchange membrane fuel cells ͑PEMFCs͒ have received

(Unicode 849 and 850).

There's also Unicode 851 ( ͓), 852 ( ͔), 1003 (ϫ), 1011 (ϳ), 1015 (Ϸ), 8217 (’), 8211(–), 8722(−), 64257 (ﬁ), 64258 (ﬂ), and the control character Ctrl-L (ASCII 12) in the pdftotext output. Some of them could be normalized to ASCII pretty easily, but some of them will require manual mapping, I think.

My questions are:

What's the best way to fix this? I've seen some similar questions, for example, here: http://stackoverflow.com/questions/2700859/how-to-replace-unicode-characters-by-ascii-characters-in-python-perl-script-giv but I have had difficulties getting them to work.
Why do different PDF readers and PDF to text utilities give such different results?

Here's the outputs of pdfinfo and pdffonts:

Title:          
Subject:        
Keywords:       
Author:         
Creator:        XPP
Producer:       Acrobat Distiller 6.0.1 (Windows)
CreationDate:   Thu Mar 23 12:07:23 2006
ModDate:        Sun Nov  4 12:48:02 2012
Tagged:         no
Pages:          6
Encrypted:      no
Page size:      657 x 855 pts
File size:      266467 bytes
Optimized:      no
PDF version:    1.4

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Helvetica                            Type 1            no  no  no      89  0
Helvetica-Oblique                    Type 1            no  no  no     109  0
Helvetica-Bold                       Type 1            no  no  no      88  0
LFNLKJ+Times-Bold                    Type 1C           yes yes no      63  0
LFNLLK+Times-Italic                  Type 1C           yes yes no      64  0
LFNLMK+Times-Roman                   Type 1C           yes yes no      65  0
LFNLML+MathematicalPi-Three          Type 1C           yes yes no      66  0
LFNLMM+MathematicalPi-One            Type 1C           yes yes no      67  0
LFNLMN+Universal-GreekwithMathPi     Type 1C           yes yes no      72  0

Answers

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Home

encoding - Why does this PDF appear to encode parentheses correctly but doesn't when using pdftotext or copying and pasting?