Different output from lynx -dump when run as cron job
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
For a couple of years now I've been "scraping," using lynx -dump
, content from a web page containing non-latin characters. I save the page content to a file, which I then modify via the agency of sed, and send that in the body of an e-mail--all this happening in a script I created. But I'm finding, after switching distros (Ubuntu to Void) that my script is not working as expected. I've identified the point of failure, as follows.
When I run the very first part of my script (the part containing lynx -dump URL
and the file name to which the content is to be saved) from the command line, all works as expected. The file shows up and contains the non-latin characters I'm expecting. However when I try to automate the process by stipulating that same command as a cron job, the results are different. The expected file does show up, but instead of containing the expected non-latin characters, what I get is the same text transliterated using latin characters--not what I want. What follows in my script is failing since it depends on the presence of the non-latin characters.
So, why these strange results depending on whether I issue the lynx command from the command line as opposed to in a cron job? Perhaps the site is doing some sort of detection and providing a transliterated page in one case but not in the other? Or is lynx itself doing the transliterating of non-latin characters into latin ones? Input will be appreciated.
cron lynx
add a comment |Â
up vote
1
down vote
favorite
For a couple of years now I've been "scraping," using lynx -dump
, content from a web page containing non-latin characters. I save the page content to a file, which I then modify via the agency of sed, and send that in the body of an e-mail--all this happening in a script I created. But I'm finding, after switching distros (Ubuntu to Void) that my script is not working as expected. I've identified the point of failure, as follows.
When I run the very first part of my script (the part containing lynx -dump URL
and the file name to which the content is to be saved) from the command line, all works as expected. The file shows up and contains the non-latin characters I'm expecting. However when I try to automate the process by stipulating that same command as a cron job, the results are different. The expected file does show up, but instead of containing the expected non-latin characters, what I get is the same text transliterated using latin characters--not what I want. What follows in my script is failing since it depends on the presence of the non-latin characters.
So, why these strange results depending on whether I issue the lynx command from the command line as opposed to in a cron job? Perhaps the site is doing some sort of detection and providing a transliterated page in one case but not in the other? Or is lynx itself doing the transliterating of non-latin characters into latin ones? Input will be appreciated.
cron lynx
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
For a couple of years now I've been "scraping," using lynx -dump
, content from a web page containing non-latin characters. I save the page content to a file, which I then modify via the agency of sed, and send that in the body of an e-mail--all this happening in a script I created. But I'm finding, after switching distros (Ubuntu to Void) that my script is not working as expected. I've identified the point of failure, as follows.
When I run the very first part of my script (the part containing lynx -dump URL
and the file name to which the content is to be saved) from the command line, all works as expected. The file shows up and contains the non-latin characters I'm expecting. However when I try to automate the process by stipulating that same command as a cron job, the results are different. The expected file does show up, but instead of containing the expected non-latin characters, what I get is the same text transliterated using latin characters--not what I want. What follows in my script is failing since it depends on the presence of the non-latin characters.
So, why these strange results depending on whether I issue the lynx command from the command line as opposed to in a cron job? Perhaps the site is doing some sort of detection and providing a transliterated page in one case but not in the other? Or is lynx itself doing the transliterating of non-latin characters into latin ones? Input will be appreciated.
cron lynx
For a couple of years now I've been "scraping," using lynx -dump
, content from a web page containing non-latin characters. I save the page content to a file, which I then modify via the agency of sed, and send that in the body of an e-mail--all this happening in a script I created. But I'm finding, after switching distros (Ubuntu to Void) that my script is not working as expected. I've identified the point of failure, as follows.
When I run the very first part of my script (the part containing lynx -dump URL
and the file name to which the content is to be saved) from the command line, all works as expected. The file shows up and contains the non-latin characters I'm expecting. However when I try to automate the process by stipulating that same command as a cron job, the results are different. The expected file does show up, but instead of containing the expected non-latin characters, what I get is the same text transliterated using latin characters--not what I want. What follows in my script is failing since it depends on the presence of the non-latin characters.
So, why these strange results depending on whether I issue the lynx command from the command line as opposed to in a cron job? Perhaps the site is doing some sort of detection and providing a transliterated page in one case but not in the other? Or is lynx itself doing the transliterating of non-latin characters into latin ones? Input will be appreciated.
cron lynx
edited Nov 4 '17 at 15:11
asked Nov 4 '17 at 15:05
MJiller
777
777
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
lynx
uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron
however, so you need to do something like this:
lynx -display_charset=UTF-8 -dump http://example.com/some/page.html
(of course, use the charset on your system if different from UTF-8).
Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
â MJiller
Nov 4 '17 at 16:19
Both solutions do essentially the same thing. Read the comments inlynx.cfg
forCHARACTER_SET
andLOCALE_CHARSET
. Readman lynx
for-display_charset
. Also, usewget
orcurl
instead oflynx
if you want a "more fundamental resolution".
â Satà  Katsura
Nov 4 '17 at 17:05
I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
â MJiller
Nov 4 '17 at 17:44
Setting locales forcron
jobs is a security hazard. There's a reason why they are disabled by default. shrug
â Satà  Katsura
Nov 4 '17 at 18:07
I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of usingwget
orcurl
, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
â MJiller
Nov 4 '17 at 18:17
add a comment |Â
up vote
1
down vote
lynx does transliteration using your locale settings as a hint. Running in cron
, it's likely that the locale is POSIX. I'd investigate that first.
For lynx's configuration, start here:
Character Sets (topic)CHARACTER_SET
LOCALE_CHARSET
LOCALE_CHARSET
overridesCHARACTER_SET
if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.
Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
â MJiller
Nov 4 '17 at 17:59
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
lynx
uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron
however, so you need to do something like this:
lynx -display_charset=UTF-8 -dump http://example.com/some/page.html
(of course, use the charset on your system if different from UTF-8).
Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
â MJiller
Nov 4 '17 at 16:19
Both solutions do essentially the same thing. Read the comments inlynx.cfg
forCHARACTER_SET
andLOCALE_CHARSET
. Readman lynx
for-display_charset
. Also, usewget
orcurl
instead oflynx
if you want a "more fundamental resolution".
â Satà  Katsura
Nov 4 '17 at 17:05
I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
â MJiller
Nov 4 '17 at 17:44
Setting locales forcron
jobs is a security hazard. There's a reason why they are disabled by default. shrug
â Satà  Katsura
Nov 4 '17 at 18:07
I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of usingwget
orcurl
, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
â MJiller
Nov 4 '17 at 18:17
add a comment |Â
up vote
1
down vote
accepted
lynx
uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron
however, so you need to do something like this:
lynx -display_charset=UTF-8 -dump http://example.com/some/page.html
(of course, use the charset on your system if different from UTF-8).
Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
â MJiller
Nov 4 '17 at 16:19
Both solutions do essentially the same thing. Read the comments inlynx.cfg
forCHARACTER_SET
andLOCALE_CHARSET
. Readman lynx
for-display_charset
. Also, usewget
orcurl
instead oflynx
if you want a "more fundamental resolution".
â Satà  Katsura
Nov 4 '17 at 17:05
I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
â MJiller
Nov 4 '17 at 17:44
Setting locales forcron
jobs is a security hazard. There's a reason why they are disabled by default. shrug
â Satà  Katsura
Nov 4 '17 at 18:07
I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of usingwget
orcurl
, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
â MJiller
Nov 4 '17 at 18:17
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
lynx
uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron
however, so you need to do something like this:
lynx -display_charset=UTF-8 -dump http://example.com/some/page.html
(of course, use the charset on your system if different from UTF-8).
lynx
uses the current locales to determine the charset it can use for showing pages. This information is probably not available from cron
however, so you need to do something like this:
lynx -display_charset=UTF-8 -dump http://example.com/some/page.html
(of course, use the charset on your system if different from UTF-8).
answered Nov 4 '17 at 15:16
Satà  Katsura
10.7k11533
10.7k11533
Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
â MJiller
Nov 4 '17 at 16:19
Both solutions do essentially the same thing. Read the comments inlynx.cfg
forCHARACTER_SET
andLOCALE_CHARSET
. Readman lynx
for-display_charset
. Also, usewget
orcurl
instead oflynx
if you want a "more fundamental resolution".
â Satà  Katsura
Nov 4 '17 at 17:05
I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
â MJiller
Nov 4 '17 at 17:44
Setting locales forcron
jobs is a security hazard. There's a reason why they are disabled by default. shrug
â Satà  Katsura
Nov 4 '17 at 18:07
I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of usingwget
orcurl
, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
â MJiller
Nov 4 '17 at 18:17
add a comment |Â
Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
â MJiller
Nov 4 '17 at 16:19
Both solutions do essentially the same thing. Read the comments inlynx.cfg
forCHARACTER_SET
andLOCALE_CHARSET
. Readman lynx
for-display_charset
. Also, usewget
orcurl
instead oflynx
if you want a "more fundamental resolution".
â Satà  Katsura
Nov 4 '17 at 17:05
I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
â MJiller
Nov 4 '17 at 17:44
Setting locales forcron
jobs is a security hazard. There's a reason why they are disabled by default. shrug
â Satà  Katsura
Nov 4 '17 at 18:07
I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of usingwget
orcurl
, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.
â MJiller
Nov 4 '17 at 18:17
Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
â MJiller
Nov 4 '17 at 16:19
Adding that switch to the cron command did, in fact, resolve the issue--thanks! Though this is the more immediate resolution to the issue I'm seeing, I wonder whether a more fundamental resolution might not lie in Thomas Dickey's response. I can't say I really understand well how the whole locale thing works and its possible impact on what I'm trying to do, so I need to look into that further. After that I'll revisit this thread and try to decide which answer to mark as the solution.
â MJiller
Nov 4 '17 at 16:19
Both solutions do essentially the same thing. Read the comments in
lynx.cfg
for CHARACTER_SET
and LOCALE_CHARSET
. Read man lynx
for -display_charset
. Also, use wget
or curl
instead of lynx
if you want a "more fundamental resolution".â Satà  Katsura
Nov 4 '17 at 17:05
Both solutions do essentially the same thing. Read the comments in
lynx.cfg
for CHARACTER_SET
and LOCALE_CHARSET
. Read man lynx
for -display_charset
. Also, use wget
or curl
instead of lynx
if you want a "more fundamental resolution".â Satà  Katsura
Nov 4 '17 at 17:05
I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
â MJiller
Nov 4 '17 at 17:44
I did not try fiddling with the lynx options since I figured that, if those do not interfere when I run this script as my user, they also should not interfere when my user's cron job runs. But they obviously do. Rather than poroviding a solution, I think maybe Thomas Dickey might have been trying to point me toward something along the lines of what's described at logikdev.com/2010/02/02/locale-settings-for-your-cron-job - correct?
â MJiller
Nov 4 '17 at 17:44
Setting locales for
cron
jobs is a security hazard. There's a reason why they are disabled by default. shrugâ Satà  Katsura
Nov 4 '17 at 18:07
Setting locales for
cron
jobs is a security hazard. There's a reason why they are disabled by default. shrugâ Satà  Katsura
Nov 4 '17 at 18:07
I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using
wget
or curl
, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.â MJiller
Nov 4 '17 at 18:17
I prefer feeding to lynx the switch mentioned, so I will not be modifying /etc/environment. I looked into the possibility of using
wget
or curl
, btw, but parsing the output I would get using those to create a basic text file such as what I'm after makes the task far more complex than if done using lynx. I'll now mark this -display_charset=UTF-8 switch suggestion as the solution to my issue. Thanks again.â MJiller
Nov 4 '17 at 18:17
add a comment |Â
up vote
1
down vote
lynx does transliteration using your locale settings as a hint. Running in cron
, it's likely that the locale is POSIX. I'd investigate that first.
For lynx's configuration, start here:
Character Sets (topic)CHARACTER_SET
LOCALE_CHARSET
LOCALE_CHARSET
overridesCHARACTER_SET
if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.
Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
â MJiller
Nov 4 '17 at 17:59
add a comment |Â
up vote
1
down vote
lynx does transliteration using your locale settings as a hint. Running in cron
, it's likely that the locale is POSIX. I'd investigate that first.
For lynx's configuration, start here:
Character Sets (topic)CHARACTER_SET
LOCALE_CHARSET
LOCALE_CHARSET
overridesCHARACTER_SET
if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.
Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
â MJiller
Nov 4 '17 at 17:59
add a comment |Â
up vote
1
down vote
up vote
1
down vote
lynx does transliteration using your locale settings as a hint. Running in cron
, it's likely that the locale is POSIX. I'd investigate that first.
For lynx's configuration, start here:
Character Sets (topic)CHARACTER_SET
LOCALE_CHARSET
LOCALE_CHARSET
overridesCHARACTER_SET
if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.
lynx does transliteration using your locale settings as a hint. Running in cron
, it's likely that the locale is POSIX. I'd investigate that first.
For lynx's configuration, start here:
Character Sets (topic)CHARACTER_SET
LOCALE_CHARSET
LOCALE_CHARSET
overridesCHARACTER_SET
if true, using the current locale to lookup a MIME name that corresponds, and use that as the display charset.
answered Nov 4 '17 at 15:15
Thomas Dickey
49.8k586155
49.8k586155
Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
â MJiller
Nov 4 '17 at 17:59
add a comment |Â
Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
â MJiller
Nov 4 '17 at 17:59
Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
â MJiller
Nov 4 '17 at 17:59
Thanks for the tip, Thomas Dickey. I did not try this yet, but from what I'm reading adding the line LANG=en_US.UTF-8 to /etc/environment might well address the issue I was experiencing (cron's locale being set to POSIX) with my script and the lynx -dump command.
â MJiller
Nov 4 '17 at 17:59
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f402510%2fdifferent-output-from-lynx-dump-when-run-as-cron-job%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password