How to get `pdftotext` to output text in a readable encoding?

I converted a PDF file into a txt file using pdftotext. As an example, I have the sentence "This is the ﬁrst study on the functional relevance of", notice the f in "ﬁrst"; when I process this sentence through GATE I get "ﬁrst" distorted as "ï¬�rst". Also, in "proteins were isolated from episomally transfected HEK293EBNA cells and puriﬁed by afﬁnity chromatography on a", some words that contain a character looks like f but it not f is distorted as well "proteins were isolated from episomally transfected hek293ebna cells and puriï¬�ed by afï¬�nity chromatography on a".

How can I get pdftotext to output text in a readable encoding?

edited Feb 26 at 14:01

Jeff Schaller

43.8k1161141

asked Mar 20 '15 at 15:29

hamid

1612

add a comment |

How can I get pdftotext to output text in a readable encoding?

edited Feb 26 at 14:01

Jeff Schaller

43.8k1161141

asked Mar 20 '15 at 15:29

hamid

1612

add a comment |

How can I get pdftotext to output text in a readable encoding?

edited Feb 26 at 14:01

Jeff Schaller

43.8k1161141

asked Mar 20 '15 at 15:29

hamid

1612

How can I get pdftotext to output text in a readable encoding?

text-processing pdf

edited Feb 26 at 14:01

Jeff Schaller

43.8k1161141

asked Mar 20 '15 at 15:29

hamid

1612

edited Feb 26 at 14:01

Jeff Schaller

43.8k1161141

asked Mar 20 '15 at 15:29

hamid

1612

edited Feb 26 at 14:01

Jeff Schaller

43.8k1161141

edited Feb 26 at 14:01

Jeff Schaller

43.8k1161141

edited Feb 26 at 14:01

Jeff Schaller

43.8k1161141

asked Mar 20 '15 at 15:29

hamid

1612

asked Mar 20 '15 at 15:29

hamid

1612

asked Mar 20 '15 at 15:29

hamid

1612

add a comment |

2 Answers
2

active

oldest

votes

Observe that, in the text you pasted, "fi" in "first" and "ffi" in
"affinity" are ligatures (multiple characters combined into a single
glyph). Presumably, pdftotext prints each of these ligatures as a
single character, which the tools you use to read the text do not support.

As a Super User question suggests, try this:

pdftotext -enc ASCII7 input.pdf output.txt

This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.

edited Mar 20 '17 at 10:18

Community♦

answered Mar 20 '15 at 15:48

dhag

11.5k33246

add a comment |

Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:

# eﬃcient -> 
# efficient
import unicodedata
pdf_text = unicodedata.normalize("NFKC", pdf_text)

answered Feb 26 at 12:52

Blaise

1113

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f191455%2fhow-to-get-pdftotext-to-output-text-in-a-readable-encoding%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

As a Super User question suggests, try this:

pdftotext -enc ASCII7 input.pdf output.txt

This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.

edited Mar 20 '17 at 10:18

Community♦

answered Mar 20 '15 at 15:48

dhag

11.5k33246

add a comment |

As a Super User question suggests, try this:

pdftotext -enc ASCII7 input.pdf output.txt

This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.

edited Mar 20 '17 at 10:18

Community♦

answered Mar 20 '15 at 15:48

dhag

11.5k33246

add a comment |

As a Super User question suggests, try this:

pdftotext -enc ASCII7 input.pdf output.txt

This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.

edited Mar 20 '17 at 10:18

Community♦

answered Mar 20 '15 at 15:48

dhag

11.5k33246

As a Super User question suggests, try this:

pdftotext -enc ASCII7 input.pdf output.txt

This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.

edited Mar 20 '17 at 10:18

Community♦

answered Mar 20 '15 at 15:48

dhag

11.5k33246

edited Mar 20 '17 at 10:18

Community♦

edited Mar 20 '17 at 10:18

Community♦

edited Mar 20 '17 at 10:18

Community♦

answered Mar 20 '15 at 15:48

dhag

11.5k33246

answered Mar 20 '15 at 15:48

dhag

11.5k33246

answered Mar 20 '15 at 15:48

dhag

11.5k33246

add a comment |

Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:

# eﬃcient -> 
# efficient
import unicodedata
pdf_text = unicodedata.normalize("NFKC", pdf_text)

answered Feb 26 at 12:52

Blaise

1113

add a comment |

Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:

# eﬃcient -> 
# efficient
import unicodedata
pdf_text = unicodedata.normalize("NFKC", pdf_text)

answered Feb 26 at 12:52

Blaise

1113

add a comment |

Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:

# eﬃcient -> 
# efficient
import unicodedata
pdf_text = unicodedata.normalize("NFKC", pdf_text)

answered Feb 26 at 12:52

Blaise

1113

Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:

# eﬃcient -> 
# efficient
import unicodedata
pdf_text = unicodedata.normalize("NFKC", pdf_text)

answered Feb 26 at 12:52

Blaise

1113

answered Feb 26 at 12:52

Blaise

1113

answered Feb 26 at 12:52

Blaise

1113

answered Feb 26 at 12:52

Blaise

1113

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu