encoding - Why does this PDF appear to encode parentheses correctly but doesn't when using pdftotext or copying and pasting?
2014-07
I have several journal articles here:
https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/A897.full.pdf
https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/B1264.full.pdf
https://dl.dropboxusercontent.com/u/3610797/malformed-pdfs/E23.full.pdf
They all encode parentheses (and other characters such as brackets) incorrectly. However, this is only apparent when trying to convert them to text or copy and paste. For example, the first line of the body of the first article should read:
Proton exchange membrane fuel cells (PEMFCs) have received
Instead, when copying and pasting from Acrobat Reader, it gives
Proton exchange membrane fuel cells PEMFCs have received
And when using "Save as text" it gives
Proton exchange membrane fuel cells ^CPEMFCs�
have received
Where the open parenthesis is ^C
, the 03 ASCII control sequence, and the closing parenthesis is Unicode 65533, the replacement character, followed by a newline.
Similarly, the pdf2txt
and pdftotext
utilities encode it as
Proton exchange membrane fuel cells 共PEMFCs兲 have received
(Unicode 20849 and 20850) and
Proton exchange membrane fuel cells ͑PEMFCs͒ have received
(Unicode 849 and 850).
There's also Unicode 851 ( ͓), 852 ( ͔), 1003 (ϫ), 1011 (ϳ), 1015 (Ϸ), 8217 (’), 8211(–), 8722(−), 64257 (fi), 64258 (fl), and the control character Ctrl-L (ASCII 12) in the pdftotext
output. Some of them could be normalized to ASCII pretty easily, but some of them will require manual mapping, I think.
My questions are:
- What's the best way to fix this? I've seen some similar questions, for example, here: http://stackoverflow.com/questions/2700859/how-to-replace-unicode-characters-by-ascii-characters-in-python-perl-script-giv but I have had difficulties getting them to work.
- Why do different PDF readers and PDF to text utilities give such different results?
Here's the outputs of pdfinfo
and pdffonts
:
Title:
Subject:
Keywords:
Author:
Creator: XPP
Producer: Acrobat Distiller 6.0.1 (Windows)
CreationDate: Thu Mar 23 12:07:23 2006
ModDate: Sun Nov 4 12:48:02 2012
Tagged: no
Pages: 6
Encrypted: no
Page size: 657 x 855 pts
File size: 266467 bytes
Optimized: no
PDF version: 1.4
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Helvetica Type 1 no no no 89 0
Helvetica-Oblique Type 1 no no no 109 0
Helvetica-Bold Type 1 no no no 88 0
LFNLKJ+Times-Bold Type 1C yes yes no 63 0
LFNLLK+Times-Italic Type 1C yes yes no 64 0
LFNLMK+Times-Roman Type 1C yes yes no 65 0
LFNLML+MathematicalPi-Three Type 1C yes yes no 66 0
LFNLMM+MathematicalPi-One Type 1C yes yes no 67 0
LFNLMN+Universal-GreekwithMathPi Type 1C yes yes no 72 0
For example when I copy text from PDF which has paragraph breaks like this:
xxxxx xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
xxxx xxxxxx
xxxxx xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
xxxx xxxxxx
but when i copy text from PDF and paste into Notepad, Word 2007 etc. output comes without paragraph breaks.
Like this:
xxxxx xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
xxxx xxxxxx
xxxxx xxxxxx xxxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxx x xxxx xx
xxxx xxxx xxxxxxxxxxx x xxxxxxxx x x xxxxxxxxxxxxxx xxxx xxx
xxxx xxxxxx
How to preserve paragraph breaks when text copy from PDF and paste into Notepad?
Unfortunately there is no concept of paragraphs (and therefore paragraph breaks) in PDF files.
You may try to use some Word macro that would try to guess paragraphs based on line length (last line in a paragraph is usually shorter) than other lines.
But otherwise, unfortunately, you are out of luck.