QPDF renders streams as gibberish

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have been trying to use a variety of programs to render a multilingual pdf (Hebrew/English dictionary) machine readable. QPDF (as well as pretty much every other program) renders the text as gibberish. I have set --decode-level=all to no avail.



What could be the issue here?










share|improve this question

























    up vote
    1
    down vote

    favorite












    I have been trying to use a variety of programs to render a multilingual pdf (Hebrew/English dictionary) machine readable. QPDF (as well as pretty much every other program) renders the text as gibberish. I have set --decode-level=all to no avail.



    What could be the issue here?










    share|improve this question























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I have been trying to use a variety of programs to render a multilingual pdf (Hebrew/English dictionary) machine readable. QPDF (as well as pretty much every other program) renders the text as gibberish. I have set --decode-level=all to no avail.



      What could be the issue here?










      share|improve this question













      I have been trying to use a variety of programs to render a multilingual pdf (Hebrew/English dictionary) machine readable. QPDF (as well as pretty much every other program) renders the text as gibberish. I have set --decode-level=all to no avail.



      What could be the issue here?







      pdf conversion






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Sep 16 at 16:07









      Theodcyning

      345




      345




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          I can't say a lot without seeing that PDF, but some basics:



          A PDF contains objects, and some objects contain streams of a simplified variant of Postscript which places glyphs on a page. (You can see the objects by opening the PDF in a text editor, and if you decompress the streams e.g. with mutool, you can also see the streams in a text editor).



          It's really difficult to convert that back into the original text (I assume that's what you mean by "machine readable"), because any such attempt has to make assumptions how the rendering application works. If the rendering application just places glyphs in the order in which they are in the original text, you can try to remap glyphs to characters, and just output the characters in this order.



          If the rendering program did something more complex, for example because you have two languages with different reading directions, such attempts will fail.



          So if you really really need it, you'll have to closely look at how your PDF does things, and write a custom program to convert it back to text.






          share|improve this answer




















            Your Answer







            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469405%2fqpdf-renders-streams-as-gibberish%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            I can't say a lot without seeing that PDF, but some basics:



            A PDF contains objects, and some objects contain streams of a simplified variant of Postscript which places glyphs on a page. (You can see the objects by opening the PDF in a text editor, and if you decompress the streams e.g. with mutool, you can also see the streams in a text editor).



            It's really difficult to convert that back into the original text (I assume that's what you mean by "machine readable"), because any such attempt has to make assumptions how the rendering application works. If the rendering application just places glyphs in the order in which they are in the original text, you can try to remap glyphs to characters, and just output the characters in this order.



            If the rendering program did something more complex, for example because you have two languages with different reading directions, such attempts will fail.



            So if you really really need it, you'll have to closely look at how your PDF does things, and write a custom program to convert it back to text.






            share|improve this answer
























              up vote
              0
              down vote













              I can't say a lot without seeing that PDF, but some basics:



              A PDF contains objects, and some objects contain streams of a simplified variant of Postscript which places glyphs on a page. (You can see the objects by opening the PDF in a text editor, and if you decompress the streams e.g. with mutool, you can also see the streams in a text editor).



              It's really difficult to convert that back into the original text (I assume that's what you mean by "machine readable"), because any such attempt has to make assumptions how the rendering application works. If the rendering application just places glyphs in the order in which they are in the original text, you can try to remap glyphs to characters, and just output the characters in this order.



              If the rendering program did something more complex, for example because you have two languages with different reading directions, such attempts will fail.



              So if you really really need it, you'll have to closely look at how your PDF does things, and write a custom program to convert it back to text.






              share|improve this answer






















                up vote
                0
                down vote










                up vote
                0
                down vote









                I can't say a lot without seeing that PDF, but some basics:



                A PDF contains objects, and some objects contain streams of a simplified variant of Postscript which places glyphs on a page. (You can see the objects by opening the PDF in a text editor, and if you decompress the streams e.g. with mutool, you can also see the streams in a text editor).



                It's really difficult to convert that back into the original text (I assume that's what you mean by "machine readable"), because any such attempt has to make assumptions how the rendering application works. If the rendering application just places glyphs in the order in which they are in the original text, you can try to remap glyphs to characters, and just output the characters in this order.



                If the rendering program did something more complex, for example because you have two languages with different reading directions, such attempts will fail.



                So if you really really need it, you'll have to closely look at how your PDF does things, and write a custom program to convert it back to text.






                share|improve this answer












                I can't say a lot without seeing that PDF, but some basics:



                A PDF contains objects, and some objects contain streams of a simplified variant of Postscript which places glyphs on a page. (You can see the objects by opening the PDF in a text editor, and if you decompress the streams e.g. with mutool, you can also see the streams in a text editor).



                It's really difficult to convert that back into the original text (I assume that's what you mean by "machine readable"), because any such attempt has to make assumptions how the rendering application works. If the rendering application just places glyphs in the order in which they are in the original text, you can try to remap glyphs to characters, and just output the characters in this order.



                If the rendering program did something more complex, for example because you have two languages with different reading directions, such attempts will fail.



                So if you really really need it, you'll have to closely look at how your PDF does things, and write a custom program to convert it back to text.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Sep 17 at 6:18









                dirkt

                14.9k2932




                14.9k2932



























                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f469405%2fqpdf-renders-streams-as-gibberish%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    How to check contact read email or not when send email to Individual?

                    How many registers does an x86_64 CPU actually have?

                    Nur Jahan