Different output from lynx -dump when run as cron job

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












For a couple of years now I've been "scraping," using lynx -dump, content from a web page containing non-latin characters. I save the page content to a file, which I then modify via the agency of sed, and send that in the body of an e-mail--all this happening in a script I created. But I'm finding, after switching distros (Ubuntu to Void) that my script is not working as expected. I've identified the point of failure, as follows.



When I run the very first part of my script (the part containing lynx -dump URL and the file name to which the content is to be saved) from the command line, all works as expected. The file shows up and contains the non-latin characters I'm expecting. However when I try to automate the process by stipulating that same command as a cron job, the results are different. The expected file does show up, but instead of containing the expected non-latin characters, what I get is the same text transliterated using latin characters--not what I want. What follows in my script is failing since it depends on the presence of the non-latin characters.



So, why these strange results depending on whether I issue the lynx command from the command line as opposed to in a cron job? Perhaps the site is doing some sort of detection and providing a transliterated page in one case but not in the other? Or is lynx itself doing the transliterating of non-latin characters into latin ones? Input will be appreciated.







share|improve this question


























    up vote
    1
    down vote

    favorite












    For a couple of years now I've been "scraping," using lynx -dump, content from a web page containing non-latin characters. I save the page content to a file, which I then modify via the agency of sed, and send that in the body of an e-mail--all this happening in a script I created. But I'm finding, after switching distros (Ubuntu to Void) that my script is not working as expected. I've identified the point of failure, as follows.



    When I run the very first part of my script (the part containing lynx -dump URL and the file name to which the content is to be saved) from the command line, all works as expected. The file shows up and contains the non-latin characters I'm expecting. However when I try to automate the process by stipulating that same command as a cron job, the results are different. The expected file does show up, but instead of containing the expected non-latin characters, what I get is the same text transliterated using latin characters--not what I want. What follows in my script is failing since it depends on the presence of the non-latin characters.



    So, why these strange results depending on whether I issue the lynx command from the command line as opposed to in a cron job? Perhaps the site is doing some sort of detection and providing a transliterated page in one case but not in the other? Or is lynx itself doing the transliterating of non-latin characters into latin ones? Input will be appreciated.







    share|improve this question
























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      For a couple of years now I've been "scraping," using lynx -dump, content from a web page containing non-latin characters. I save the page content to a file, which I then modify via the agency of sed, and send that in the body of an e-mail--all this happening in a script I created. But I'm finding, after switching distros (Ubuntu to Void) that my script is not working as expected. I've identified the point of failure, as follows.



      When I run the very first part of my script (the part containing lynx -dump URL and the file name to which the content is to be saved) from the command line, all works as expected. The file shows up and contains the non-latin characters I'm expecting. However when I try to automate the process by stipulating that same command as a cron job, the results are different. The expected file does show up, but instead of containing the expected non-latin characters, what I get is the same text transliterated using latin characters--not what I want. What follows in my script is failing since it depends on the presence of the non-latin characters.



      So, why these strange results depending on whether I issue the lynx command from the command line as opposed to in a cron job? Perhaps the site is doing some sort of detection and providing a transliterated page in one case but not in the other? Or is lynx itself doing the transliterating of non-latin characters into latin ones? Input will be appreciated.







      share|improve this question














      For a couple of years now I've been "scraping," using lynx -dump, content from a web page containing non-latin characters. I save the page content to a file, which I then modify via the agency of sed, and send that in the body of an e-mail--all this happening in a script I created. But I'm finding, after switching distros (Ubuntu to Void) that my script is not working as expected. I've identified the point of failure, as follows.



      When I run the very first part of my script (the part containing lynx -dump URL and the file name to which the content is to be saved) from the command line, all works as expected. The file shows up and contains the non-latin characters I'm expecting. However when I try to automate the process by stipulating that same command as a cron job, the results are different. The expected file does show up, but instead of containing the expected non-latin characters, what I get is the same text transliterated using latin characters--not what I want. What follows in my script is failing since it depends on the presence of the non-latin characters.



      So, why these strange results depending on whether I issue the lynx command from the command line as opposed to in a cron job? Perhaps the site is doing some sort of detection and providing a transliterated page in one case but not in the other? Or is lynx itself doing the transliterating of non-latin characters into latin ones? Input will be appreciated.









      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 4 '17 at 15:11

























      asked Nov 4 '17 at 15:05









      MJiller

      777




      777




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          lynx uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron however, so you need to do something like this:



          lynx -display_charset=UTF-8 -dump http://example.com/some/page.html


          (of course, use the charset on your system if different from UTF-8).






          share|improve this answer




















          • Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
            – MJiller
            Nov 4 '17 at 16:19










          • Both solutions do essentially the same thing. Read the comments in lynx.cfg for CHARACTER_SET and LOCALE_CHARSET. Read man lynx for -display_charset. Also, use wget or curl instead of lynx if you want a "more fundamental resolution".
            – Satō Katsura
            Nov 4 '17 at 17:05










          • I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
            – MJiller
            Nov 4 '17 at 17:44










          • Setting locales for cron jobs is a security hazard. There's a reason why they are disabled by default. shrug
            – Satō Katsura
            Nov 4 '17 at 18:07










          • I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using wget or curl, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
            – MJiller
            Nov 4 '17 at 18:17

















          up vote
          1
          down vote













          lynx does transliteration using your locale settings as a hint. Running in cron, it's likely that the locale is POSIX. I'd investigate that first.



          For lynx's configuration, start here:




          • Character Sets (topic)

          • CHARACTER_SET


          • LOCALE_CHARSET


            LOCALE_CHARSET overrides CHARACTER_SET if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.








          share|improve this answer




















          • Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
            – MJiller
            Nov 4 '17 at 17:59











          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f402510%2fdifferent-output-from-lynx-dump-when-run-as-cron-job%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote



          accepted










          lynx uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron however, so you need to do something like this:



          lynx -display_charset=UTF-8 -dump http://example.com/some/page.html


          (of course, use the charset on your system if different from UTF-8).






          share|improve this answer




















          • Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
            – MJiller
            Nov 4 '17 at 16:19










          • Both solutions do essentially the same thing. Read the comments in lynx.cfg for CHARACTER_SET and LOCALE_CHARSET. Read man lynx for -display_charset. Also, use wget or curl instead of lynx if you want a "more fundamental resolution".
            – Satō Katsura
            Nov 4 '17 at 17:05










          • I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
            – MJiller
            Nov 4 '17 at 17:44










          • Setting locales for cron jobs is a security hazard. There's a reason why they are disabled by default. shrug
            – Satō Katsura
            Nov 4 '17 at 18:07










          • I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using wget or curl, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
            – MJiller
            Nov 4 '17 at 18:17














          up vote
          1
          down vote



          accepted










          lynx uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron however, so you need to do something like this:



          lynx -display_charset=UTF-8 -dump http://example.com/some/page.html


          (of course, use the charset on your system if different from UTF-8).






          share|improve this answer




















          • Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
            – MJiller
            Nov 4 '17 at 16:19










          • Both solutions do essentially the same thing. Read the comments in lynx.cfg for CHARACTER_SET and LOCALE_CHARSET. Read man lynx for -display_charset. Also, use wget or curl instead of lynx if you want a "more fundamental resolution".
            – Satō Katsura
            Nov 4 '17 at 17:05










          • I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
            – MJiller
            Nov 4 '17 at 17:44










          • Setting locales for cron jobs is a security hazard. There's a reason why they are disabled by default. shrug
            – Satō Katsura
            Nov 4 '17 at 18:07










          • I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using wget or curl, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
            – MJiller
            Nov 4 '17 at 18:17












          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          lynx uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron however, so you need to do something like this:



          lynx -display_charset=UTF-8 -dump http://example.com/some/page.html


          (of course, use the charset on your system if different from UTF-8).






          share|improve this answer












          lynx uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron however, so you need to do something like this:



          lynx -display_charset=UTF-8 -dump http://example.com/some/page.html


          (of course, use the charset on your system if different from UTF-8).







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 4 '17 at 15:16









          Satō Katsura

          10.7k11533




          10.7k11533











          • Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
            – MJiller
            Nov 4 '17 at 16:19










          • Both solutions do essentially the same thing. Read the comments in lynx.cfg for CHARACTER_SET and LOCALE_CHARSET. Read man lynx for -display_charset. Also, use wget or curl instead of lynx if you want a "more fundamental resolution".
            – Satō Katsura
            Nov 4 '17 at 17:05










          • I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
            – MJiller
            Nov 4 '17 at 17:44










          • Setting locales for cron jobs is a security hazard. There's a reason why they are disabled by default. shrug
            – Satō Katsura
            Nov 4 '17 at 18:07










          • I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using wget or curl, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
            – MJiller
            Nov 4 '17 at 18:17
















          • Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
            – MJiller
            Nov 4 '17 at 16:19










          • Both solutions do essentially the same thing. Read the comments in lynx.cfg for CHARACTER_SET and LOCALE_CHARSET. Read man lynx for -display_charset. Also, use wget or curl instead of lynx if you want a "more fundamental resolution".
            – Satō Katsura
            Nov 4 '17 at 17:05










          • I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
            – MJiller
            Nov 4 '17 at 17:44










          • Setting locales for cron jobs is a security hazard. There's a reason why they are disabled by default. shrug
            – Satō Katsura
            Nov 4 '17 at 18:07










          • I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using wget or curl, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
            – MJiller
            Nov 4 '17 at 18:17















          Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
          – MJiller
          Nov 4 '17 at 16:19




          Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
          – MJiller
          Nov 4 '17 at 16:19












          Both solutions do essentially the same thing. Read the comments in lynx.cfg for CHARACTER_SET and LOCALE_CHARSET. Read man lynx for -display_charset. Also, use wget or curl instead of lynx if you want a "more fundamental resolution".
          – Satō Katsura
          Nov 4 '17 at 17:05




          Both solutions do essentially the same thing. Read the comments in lynx.cfg for CHARACTER_SET and LOCALE_CHARSET. Read man lynx for -display_charset. Also, use wget or curl instead of lynx if you want a "more fundamental resolution".
          – Satō Katsura
          Nov 4 '17 at 17:05












          I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
          – MJiller
          Nov 4 '17 at 17:44




          I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
          – MJiller
          Nov 4 '17 at 17:44












          Setting locales for cron jobs is a security hazard. There's a reason why they are disabled by default. shrug
          – Satō Katsura
          Nov 4 '17 at 18:07




          Setting locales for cron jobs is a security hazard. There's a reason why they are disabled by default. shrug
          – Satō Katsura
          Nov 4 '17 at 18:07












          I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using wget or curl, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
          – MJiller
          Nov 4 '17 at 18:17




          I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using wget or curl, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
          – MJiller
          Nov 4 '17 at 18:17












          up vote
          1
          down vote













          lynx does transliteration using your locale settings as a hint. Running in cron, it's likely that the locale is POSIX. I'd investigate that first.



          For lynx's configuration, start here:




          • Character Sets (topic)

          • CHARACTER_SET


          • LOCALE_CHARSET


            LOCALE_CHARSET overrides CHARACTER_SET if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.








          share|improve this answer




















          • Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
            – MJiller
            Nov 4 '17 at 17:59















          up vote
          1
          down vote













          lynx does transliteration using your locale settings as a hint. Running in cron, it's likely that the locale is POSIX. I'd investigate that first.



          For lynx's configuration, start here:




          • Character Sets (topic)

          • CHARACTER_SET


          • LOCALE_CHARSET


            LOCALE_CHARSET overrides CHARACTER_SET if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.








          share|improve this answer




















          • Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
            – MJiller
            Nov 4 '17 at 17:59













          up vote
          1
          down vote










          up vote
          1
          down vote









          lynx does transliteration using your locale settings as a hint. Running in cron, it's likely that the locale is POSIX. I'd investigate that first.



          For lynx's configuration, start here:




          • Character Sets (topic)

          • CHARACTER_SET


          • LOCALE_CHARSET


            LOCALE_CHARSET overrides CHARACTER_SET if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.








          share|improve this answer












          lynx does transliteration using your locale settings as a hint. Running in cron, it's likely that the locale is POSIX. I'd investigate that first.



          For lynx's configuration, start here:




          • Character Sets (topic)

          • CHARACTER_SET


          • LOCALE_CHARSET


            LOCALE_CHARSET overrides CHARACTER_SET if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.









          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 4 '17 at 15:15









          Thomas Dickey

          49.8k586155




          49.8k586155











          • Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
            – MJiller
            Nov 4 '17 at 17:59

















          • Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
            – MJiller
            Nov 4 '17 at 17:59
















          Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
          – MJiller
          Nov 4 '17 at 17:59





          Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
          – MJiller
          Nov 4 '17 at 17:59


















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f402510%2fdifferent-output-from-lynx-dump-when-run-as-cron-job%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Displaying single band from multi-band raster using QGIS

          How many registers does an x86_64 CPU actually have?