How to get `pdftotext` to output text in a readable encoding?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












3















I converted a PDF file into a txt file using pdftotext. As an example, I have the sentence "This is the first study on the functional relevance of", notice the f in "first"; when I process this sentence through GATE I get "first" distorted as "�rst". Also, in "proteins were isolated from episomally transfected HEK293EBNA cells and purified by affinity chromatography on a", some words that contain a character looks like f but it not f is distorted as well "proteins were isolated from episomally transfected hek293ebna cells and puri�ed by af�nity chromatography on a".



How can I get pdftotext to output text in a readable encoding?










share|improve this question




























    3















    I converted a PDF file into a txt file using pdftotext. As an example, I have the sentence "This is the first study on the functional relevance of", notice the f in "first"; when I process this sentence through GATE I get "first" distorted as "�rst". Also, in "proteins were isolated from episomally transfected HEK293EBNA cells and purified by affinity chromatography on a", some words that contain a character looks like f but it not f is distorted as well "proteins were isolated from episomally transfected hek293ebna cells and puri�ed by af�nity chromatography on a".



    How can I get pdftotext to output text in a readable encoding?










    share|improve this question


























      3












      3








      3








      I converted a PDF file into a txt file using pdftotext. As an example, I have the sentence "This is the first study on the functional relevance of", notice the f in "first"; when I process this sentence through GATE I get "first" distorted as "�rst". Also, in "proteins were isolated from episomally transfected HEK293EBNA cells and purified by affinity chromatography on a", some words that contain a character looks like f but it not f is distorted as well "proteins were isolated from episomally transfected hek293ebna cells and puri�ed by af�nity chromatography on a".



      How can I get pdftotext to output text in a readable encoding?










      share|improve this question
















      I converted a PDF file into a txt file using pdftotext. As an example, I have the sentence "This is the first study on the functional relevance of", notice the f in "first"; when I process this sentence through GATE I get "first" distorted as "�rst". Also, in "proteins were isolated from episomally transfected HEK293EBNA cells and purified by affinity chromatography on a", some words that contain a character looks like f but it not f is distorted as well "proteins were isolated from episomally transfected hek293ebna cells and puri�ed by af�nity chromatography on a".



      How can I get pdftotext to output text in a readable encoding?







      text-processing pdf






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Feb 26 at 14:01









      Jeff Schaller

      43.8k1161141




      43.8k1161141










      asked Mar 20 '15 at 15:29









      hamidhamid

      1612




      1612




















          2 Answers
          2






          active

          oldest

          votes


















          4














          Observe that, in the text you pasted, "fi" in "first" and "ffi" in
          "affinity" are ligatures (multiple characters combined into a single
          glyph). Presumably, pdftotext prints each of these ligatures as a
          single character, which the tools you use to read the text do not support.



          As a Super User question suggests, try this:



          pdftotext -enc ASCII7 input.pdf output.txt


          This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.






          share|improve this answer
































            1














            Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:



            # efficient -> 
            # efficient
            import unicodedata
            pdf_text = unicodedata.normalize("NFKC", pdf_text)





            share|improve this answer






















              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "106"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f191455%2fhow-to-get-pdftotext-to-output-text-in-a-readable-encoding%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              4














              Observe that, in the text you pasted, "fi" in "first" and "ffi" in
              "affinity" are ligatures (multiple characters combined into a single
              glyph). Presumably, pdftotext prints each of these ligatures as a
              single character, which the tools you use to read the text do not support.



              As a Super User question suggests, try this:



              pdftotext -enc ASCII7 input.pdf output.txt


              This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.






              share|improve this answer





























                4














                Observe that, in the text you pasted, "fi" in "first" and "ffi" in
                "affinity" are ligatures (multiple characters combined into a single
                glyph). Presumably, pdftotext prints each of these ligatures as a
                single character, which the tools you use to read the text do not support.



                As a Super User question suggests, try this:



                pdftotext -enc ASCII7 input.pdf output.txt


                This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.






                share|improve this answer



























                  4












                  4








                  4







                  Observe that, in the text you pasted, "fi" in "first" and "ffi" in
                  "affinity" are ligatures (multiple characters combined into a single
                  glyph). Presumably, pdftotext prints each of these ligatures as a
                  single character, which the tools you use to read the text do not support.



                  As a Super User question suggests, try this:



                  pdftotext -enc ASCII7 input.pdf output.txt


                  This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.






                  share|improve this answer















                  Observe that, in the text you pasted, "fi" in "first" and "ffi" in
                  "affinity" are ligatures (multiple characters combined into a single
                  glyph). Presumably, pdftotext prints each of these ligatures as a
                  single character, which the tools you use to read the text do not support.



                  As a Super User question suggests, try this:



                  pdftotext -enc ASCII7 input.pdf output.txt


                  This should prevent pdftotext from printing ligatures verbatim, forcing it to expand them into ASCII characters.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Mar 20 '17 at 10:18









                  Community

                  1




                  1










                  answered Mar 20 '15 at 15:48









                  dhagdhag

                  11.5k33246




                  11.5k33246























                      1














                      Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:



                      # efficient -> 
                      # efficient
                      import unicodedata
                      pdf_text = unicodedata.normalize("NFKC", pdf_text)





                      share|improve this answer



























                        1














                        Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:



                        # efficient -> 
                        # efficient
                        import unicodedata
                        pdf_text = unicodedata.normalize("NFKC", pdf_text)





                        share|improve this answer

























                          1












                          1








                          1







                          Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:



                          # efficient -> 
                          # efficient
                          import unicodedata
                          pdf_text = unicodedata.normalize("NFKC", pdf_text)





                          share|improve this answer













                          Since I was already converting pdfs to text in Python, I post-process the pdf text using a simple Python command:



                          # efficient -> 
                          # efficient
                          import unicodedata
                          pdf_text = unicodedata.normalize("NFKC", pdf_text)






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Feb 26 at 12:52









                          BlaiseBlaise

                          1113




                          1113



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Unix & Linux Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f191455%2fhow-to-get-pdftotext-to-output-text-in-a-readable-encoding%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown






                              Popular posts from this blog

                              How to check contact read email or not when send email to Individual?

                              Bahrain

                              Postfix configuration issue with fips on centos 7; mailgun relay