Why wget is not willing to download recursively?

Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
The command
$ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example
<A HREF="viewp.html">Viewpoint specification</A>
Why does wget ignore that?
wget recursive download
add a comment |Â
up vote
3
down vote
favorite
The command
$ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example
<A HREF="viewp.html">Viewpoint specification</A>
Why does wget ignore that?
wget recursive download
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
The command
$ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example
<A HREF="viewp.html">Viewpoint specification</A>
Why does wget ignore that?
wget recursive download
The command
$ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example
<A HREF="viewp.html">Viewpoint specification</A>
Why does wget ignore that?
wget recursive download
wget recursive download
edited Dec 31 '15 at 21:37
jimmij
29.5k867101
29.5k867101
asked Dec 31 '15 at 20:22
foobar
3412
3412
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
7
down vote
it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.
I tested this, and found the issue immediately:
wget respects robots.txt unless explicitly told not to.
wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâÂÂ
www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s
2015-12-31 12:29:53 (31.9 MB/s) - âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâ saved [878/878]
Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: âÂÂwww.comp.brad.ac.uk/robots.txtâÂÂ
www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s
2015-12-31 12:29:53 (1.02 MB/s) - âÂÂwww.comp.brad.ac.uk/robots.txtâ saved [26/26]
FINISHED --2015-12-31 12:29:53--
As you can see, wget did what it was asked by you, perfectly.
What does the robots.txt say in this case?
cat robots.txt
User-agent: *
Disallow: /
So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.
wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.
There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:
wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
-l2means 2 levels max.-lmeans: level.-npmeans don't go UP in the tree, just in, from the start page.-npmeans: no parent.
It just depends on the target page, sometimes you want to specify exactly what to get and not get, for example, in this case, you are only getting the default of .html/.htm extensions, not graphics, pdfs, music/video extensions. The -A option lets you add extension types to grab.
By the way, I checked and my wget, version 1.17, is from 2015. Not sure what version you are using. Python by the way I think was also created in the 90s, so by your reasoning, python is also junk from the 90s.
I admit the wget --help is quite intense and feature rich, as is the wget man page, so it's understandable why someone would want to not read it, but there are tons of online tutorials that tell you how do most common wget actions.
Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â foobar
Dec 31 '15 at 20:46
There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail!man wgetsays "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â foobar
Dec 31 '15 at 20:54
Happy new year!
â foobar
Dec 31 '15 at 20:56
add a comment |Â
up vote
0
down vote
this must be one of the best replies ever. :)
New contributor
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.
I tested this, and found the issue immediately:
wget respects robots.txt unless explicitly told not to.
wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâÂÂ
www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s
2015-12-31 12:29:53 (31.9 MB/s) - âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâ saved [878/878]
Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: âÂÂwww.comp.brad.ac.uk/robots.txtâÂÂ
www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s
2015-12-31 12:29:53 (1.02 MB/s) - âÂÂwww.comp.brad.ac.uk/robots.txtâ saved [26/26]
FINISHED --2015-12-31 12:29:53--
As you can see, wget did what it was asked by you, perfectly.
What does the robots.txt say in this case?
cat robots.txt
User-agent: *
Disallow: /
So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.
wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.
There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:
wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
-l2means 2 levels max.-lmeans: level.-npmeans don't go UP in the tree, just in, from the start page.-npmeans: no parent.
It just depends on the target page, sometimes you want to specify exactly what to get and not get, for example, in this case, you are only getting the default of .html/.htm extensions, not graphics, pdfs, music/video extensions. The -A option lets you add extension types to grab.
By the way, I checked and my wget, version 1.17, is from 2015. Not sure what version you are using. Python by the way I think was also created in the 90s, so by your reasoning, python is also junk from the 90s.
I admit the wget --help is quite intense and feature rich, as is the wget man page, so it's understandable why someone would want to not read it, but there are tons of online tutorials that tell you how do most common wget actions.
Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â foobar
Dec 31 '15 at 20:46
There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail!man wgetsays "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â foobar
Dec 31 '15 at 20:54
Happy new year!
â foobar
Dec 31 '15 at 20:56
add a comment |Â
up vote
7
down vote
it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.
I tested this, and found the issue immediately:
wget respects robots.txt unless explicitly told not to.
wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâÂÂ
www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s
2015-12-31 12:29:53 (31.9 MB/s) - âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâ saved [878/878]
Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: âÂÂwww.comp.brad.ac.uk/robots.txtâÂÂ
www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s
2015-12-31 12:29:53 (1.02 MB/s) - âÂÂwww.comp.brad.ac.uk/robots.txtâ saved [26/26]
FINISHED --2015-12-31 12:29:53--
As you can see, wget did what it was asked by you, perfectly.
What does the robots.txt say in this case?
cat robots.txt
User-agent: *
Disallow: /
So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.
wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.
There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:
wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
-l2means 2 levels max.-lmeans: level.-npmeans don't go UP in the tree, just in, from the start page.-npmeans: no parent.
It just depends on the target page, sometimes you want to specify exactly what to get and not get, for example, in this case, you are only getting the default of .html/.htm extensions, not graphics, pdfs, music/video extensions. The -A option lets you add extension types to grab.
By the way, I checked and my wget, version 1.17, is from 2015. Not sure what version you are using. Python by the way I think was also created in the 90s, so by your reasoning, python is also junk from the 90s.
I admit the wget --help is quite intense and feature rich, as is the wget man page, so it's understandable why someone would want to not read it, but there are tons of online tutorials that tell you how do most common wget actions.
Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â foobar
Dec 31 '15 at 20:46
There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail!man wgetsays "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â foobar
Dec 31 '15 at 20:54
Happy new year!
â foobar
Dec 31 '15 at 20:56
add a comment |Â
up vote
7
down vote
up vote
7
down vote
it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.
I tested this, and found the issue immediately:
wget respects robots.txt unless explicitly told not to.
wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâÂÂ
www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s
2015-12-31 12:29:53 (31.9 MB/s) - âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâ saved [878/878]
Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: âÂÂwww.comp.brad.ac.uk/robots.txtâÂÂ
www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s
2015-12-31 12:29:53 (1.02 MB/s) - âÂÂwww.comp.brad.ac.uk/robots.txtâ saved [26/26]
FINISHED --2015-12-31 12:29:53--
As you can see, wget did what it was asked by you, perfectly.
What does the robots.txt say in this case?
cat robots.txt
User-agent: *
Disallow: /
So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.
wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.
There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:
wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
-l2means 2 levels max.-lmeans: level.-npmeans don't go UP in the tree, just in, from the start page.-npmeans: no parent.
It just depends on the target page, sometimes you want to specify exactly what to get and not get, for example, in this case, you are only getting the default of .html/.htm extensions, not graphics, pdfs, music/video extensions. The -A option lets you add extension types to grab.
By the way, I checked and my wget, version 1.17, is from 2015. Not sure what version you are using. Python by the way I think was also created in the 90s, so by your reasoning, python is also junk from the 90s.
I admit the wget --help is quite intense and feature rich, as is the wget man page, so it's understandable why someone would want to not read it, but there are tons of online tutorials that tell you how do most common wget actions.
it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.
I tested this, and found the issue immediately:
wget respects robots.txt unless explicitly told not to.
wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâÂÂ
www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s
2015-12-31 12:29:53 (31.9 MB/s) - âÂÂwww.comp.brad.ac.uk/research/GIP/tutorials/index.htmlâ saved [878/878]
Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: âÂÂwww.comp.brad.ac.uk/robots.txtâÂÂ
www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s
2015-12-31 12:29:53 (1.02 MB/s) - âÂÂwww.comp.brad.ac.uk/robots.txtâ saved [26/26]
FINISHED --2015-12-31 12:29:53--
As you can see, wget did what it was asked by you, perfectly.
What does the robots.txt say in this case?
cat robots.txt
User-agent: *
Disallow: /
So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.
wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.
There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:
wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
-l2means 2 levels max.-lmeans: level.-npmeans don't go UP in the tree, just in, from the start page.-npmeans: no parent.
It just depends on the target page, sometimes you want to specify exactly what to get and not get, for example, in this case, you are only getting the default of .html/.htm extensions, not graphics, pdfs, music/video extensions. The -A option lets you add extension types to grab.
By the way, I checked and my wget, version 1.17, is from 2015. Not sure what version you are using. Python by the way I think was also created in the 90s, so by your reasoning, python is also junk from the 90s.
I admit the wget --help is quite intense and feature rich, as is the wget man page, so it's understandable why someone would want to not read it, but there are tons of online tutorials that tell you how do most common wget actions.
edited Dec 31 '15 at 20:56
muru
34k578147
34k578147
answered Dec 31 '15 at 20:35
Lizardx
1,591410
1,591410
Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â foobar
Dec 31 '15 at 20:46
There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail!man wgetsays "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â foobar
Dec 31 '15 at 20:54
Happy new year!
â foobar
Dec 31 '15 at 20:56
add a comment |Â
Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â foobar
Dec 31 '15 at 20:46
There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail!man wgetsays "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â foobar
Dec 31 '15 at 20:54
Happy new year!
â foobar
Dec 31 '15 at 20:56
Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â foobar
Dec 31 '15 at 20:46
Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â foobar
Dec 31 '15 at 20:46
There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail!
man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.â foobar
Dec 31 '15 at 20:54
There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail!
man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.â foobar
Dec 31 '15 at 20:54
Happy new year!
â foobar
Dec 31 '15 at 20:56
Happy new year!
â foobar
Dec 31 '15 at 20:56
add a comment |Â
up vote
0
down vote
this must be one of the best replies ever. :)
New contributor
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
0
down vote
this must be one of the best replies ever. :)
New contributor
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
this must be one of the best replies ever. :)
New contributor
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
this must be one of the best replies ever. :)
New contributor
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 11 mins ago
Kaioo
1
1
New contributor
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f252562%2fwhy-wget-is-not-willing-to-download-recursively%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password