Why wget is not willing to download recursively?

up vote
3
down vote

favorite

The command

$ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example

<A HREF="viewp.html">Viewpoint specification</A>

Why does wget ignore that?

edited Dec 31 '15 at 21:37

jimmij

29.5k867101

asked Dec 31 '15 at 20:22

foobar

3412

add a commentÂ |Â

up vote
3
down vote

favorite

The command

$ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example

<A HREF="viewp.html">Viewpoint specification</A>

Why does wget ignore that?

edited Dec 31 '15 at 21:37

jimmij

29.5k867101

asked Dec 31 '15 at 20:22

foobar

3412

add a commentÂ |Â

up vote
3
down vote

favorite

The command

$ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example

<A HREF="viewp.html">Viewpoint specification</A>

Why does wget ignore that?

edited Dec 31 '15 at 21:37

jimmij

29.5k867101

asked Dec 31 '15 at 20:22

foobar

3412

The command

$ wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

only downloads index.html and robots.txt for me, even though there are links in it to further pages in the same directory. For example

<A HREF="viewp.html">Viewpoint specification</A>

Why does wget ignore that?

wget recursive download

edited Dec 31 '15 at 21:37

jimmij

29.5k867101

asked Dec 31 '15 at 20:22

foobar

3412

edited Dec 31 '15 at 21:37

jimmij

29.5k867101

asked Dec 31 '15 at 20:22

foobar

3412

edited Dec 31 '15 at 21:37

jimmij

29.5k867101

edited Dec 31 '15 at 21:37

jimmij

29.5k867101

edited Dec 31 '15 at 21:37

jimmij

29.5k867101

asked Dec 31 '15 at 20:22

foobar

3412

asked Dec 31 '15 at 20:22

foobar

3412

asked Dec 31 '15 at 20:22

foobar

3412

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
7
down vote

it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.

I tested this, and found the issue immediately:

wget respects robots.txt unless explicitly told not to.

wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™

www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s 

2015-12-31 12:29:53 (31.9 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™ saved [878/878]

Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™

www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s 

2015-12-31 12:29:53 (1.02 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™ saved [26/26]

FINISHED --2015-12-31 12:29:53--

As you can see, wget did what it was asked by you, perfectly.

What does the robots.txt say in this case?

cat robots.txt
User-agent: *
Disallow: /

So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.

wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.

There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:

wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

-l2 means 2 levels max. -l means: level.

-np means don't go UP in the tree, just in, from the start page. -np means: no parent.

It just depends on the target page, sometimes you want to specify exactly what to get and not get, for example, in this case, you are only getting the default of .html/.htm extensions, not graphics, pdfs, music/video extensions. The -A option lets you add extension types to grab.

By the way, I checked and my wget, version 1.17, is from 2015. Not sure what version you are using. Python by the way I think was also created in the 90s, so by your reasoning, python is also junk from the 90s.

I admit the wget --help is quite intense and feature rich, as is the wget man page, so it's understandable why someone would want to not read it, but there are tons of online tutorials that tell you how do most common wget actions.

edited Dec 31 '15 at 20:56

muru

34k578147

answered Dec 31 '15 at 20:35

Lizardx

1,591410

Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â€“Â foobar
Dec 31 '15 at 20:46

There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail! man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â€“Â foobar
Dec 31 '15 at 20:54

Happy new year!
â€“Â foobar
Dec 31 '15 at 20:56

add a commentÂ |Â

up vote
0
down vote

this must be one of the best replies ever. :)

answered 11 mins ago

Kaioo

New contributor

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f252562%2fwhy-wget-is-not-willing-to-download-recursively%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
7
down vote

it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.

I tested this, and found the issue immediately:

wget respects robots.txt unless explicitly told not to.

wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™

www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s 

2015-12-31 12:29:53 (31.9 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™ saved [878/878]

Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™

www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s 

2015-12-31 12:29:53 (1.02 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™ saved [26/26]

FINISHED --2015-12-31 12:29:53--

As you can see, wget did what it was asked by you, perfectly.

What does the robots.txt say in this case?

cat robots.txt
User-agent: *
Disallow: /

So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.

wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.

There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:

wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

-l2 means 2 levels max. -l means: level.

-np means don't go UP in the tree, just in, from the start page. -np means: no parent.

edited Dec 31 '15 at 20:56

muru

34k578147

answered Dec 31 '15 at 20:35

Lizardx

1,591410

Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â€“Â foobar
Dec 31 '15 at 20:46

There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail! man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â€“Â foobar
Dec 31 '15 at 20:54

Happy new year!
â€“Â foobar
Dec 31 '15 at 20:56

add a commentÂ |Â

up vote
7
down vote

it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.

I tested this, and found the issue immediately:

wget respects robots.txt unless explicitly told not to.

wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™

www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s 

2015-12-31 12:29:53 (31.9 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™ saved [878/878]

Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™

www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s 

2015-12-31 12:29:53 (1.02 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™ saved [26/26]

FINISHED --2015-12-31 12:29:53--

As you can see, wget did what it was asked by you, perfectly.

What does the robots.txt say in this case?

cat robots.txt
User-agent: *
Disallow: /

So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.

wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.

There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:

wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

-l2 means 2 levels max. -l means: level.

-np means don't go UP in the tree, just in, from the start page. -np means: no parent.

edited Dec 31 '15 at 20:56

muru

34k578147

answered Dec 31 '15 at 20:35

Lizardx

1,591410

Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â€“Â foobar
Dec 31 '15 at 20:46

There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail! man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â€“Â foobar
Dec 31 '15 at 20:54

Happy new year!
â€“Â foobar
Dec 31 '15 at 20:56

add a commentÂ |Â

up vote
7
down vote

it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.

I tested this, and found the issue immediately:

wget respects robots.txt unless explicitly told not to.

wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™

www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s 

2015-12-31 12:29:53 (31.9 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™ saved [878/878]

Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™

www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s 

2015-12-31 12:29:53 (1.02 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™ saved [26/26]

FINISHED --2015-12-31 12:29:53--

As you can see, wget did what it was asked by you, perfectly.

What does the robots.txt say in this case?

cat robots.txt
User-agent: *
Disallow: /

So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.

wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.

There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:

wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

-l2 means 2 levels max. -l means: level.

-np means don't go UP in the tree, just in, from the start page. -np means: no parent.

edited Dec 31 '15 at 20:56

muru

34k578147

answered Dec 31 '15 at 20:35

Lizardx

1,591410

it's generally a mistake in tech to mistake ones own fundamental ignorance for a flaw with the technology one is completely ignorant of.

I tested this, and found the issue immediately:

wget respects robots.txt unless explicitly told not to.

wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52-- http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™

www.comp.brad.ac.uk/research/GI 100%[======================================================>] 878 --.-KB/s in 0s 

2015-12-31 12:29:53 (31.9 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/research/GIP/tutorials/index.htmlÃ¢Â€Â™ saved [878/878]

Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53-- http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™

www.comp.brad.ac.uk/robots.txt 100%[======================================================>] 26 --.-KB/s in 0s 

2015-12-31 12:29:53 (1.02 MB/s) - Ã¢Â€Â˜www.comp.brad.ac.uk/robots.txtÃ¢Â€Â™ saved [26/26]

FINISHED --2015-12-31 12:29:53--

As you can see, wget did what it was asked by you, perfectly.

What does the robots.txt say in this case?

cat robots.txt
User-agent: *
Disallow: /

So this site doesn't want robots downloading stuff, at least not ones that are reading and following the robots.txt, usually this means they don't want to be indexed in search engines.

wget -r -erobots=off http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

Now, if wget is simply too powerful for you to learn, that's fine too, but don't make the error of thinking the flaw is in wget.

There's a risk to doing recursive downloads of a site however, so it's sometimes best to use limits to avoid grabbing the entire site:

wget -r -erobots=off -l2 -np http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

-l2 means 2 levels max. -l means: level.

-np means don't go UP in the tree, just in, from the start page. -np means: no parent.

edited Dec 31 '15 at 20:56

muru

34k578147

answered Dec 31 '15 at 20:35

Lizardx

1,591410

edited Dec 31 '15 at 20:56

muru

34k578147

edited Dec 31 '15 at 20:56

muru

34k578147

edited Dec 31 '15 at 20:56

muru

34k578147

answered Dec 31 '15 at 20:35

Lizardx

1,591410

answered Dec 31 '15 at 20:35

Lizardx

1,591410

answered Dec 31 '15 at 20:35

Lizardx

1,591410

Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â€“Â foobar
Dec 31 '15 at 20:46

There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail! man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â€“Â foobar
Dec 31 '15 at 20:54

Happy new year!
â€“Â foobar
Dec 31 '15 at 20:56

add a commentÂ |Â

Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â€“Â foobar
Dec 31 '15 at 20:46

There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail! man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â€“Â foobar
Dec 31 '15 at 20:54

Happy new year!
â€“Â foobar
Dec 31 '15 at 20:56

Yes it is a flaw, if I say recursive, then it should do just that! Otherwise it is misdocumented! Btw I knew the levels, but it was clear that this has few. I am not a robot.
â€“Â foobar
Dec 31 '15 at 20:46

There is a reason we have (user) interfaces (and documentation) for software. Division of labour! One cannot learn every little technical detail! man wget says "Turn on recursive retrieving." and not "Turn on recursive retrieving but stop if robots.txt recommends so." I want to be in charge of my software not some webmaster, who clearly failed with his robots.txt.
â€“Â foobar
Dec 31 '15 at 20:54

Happy new year!
â€“Â foobar
Dec 31 '15 at 20:56

add a commentÂ |Â

up vote
0
down vote

this must be one of the best replies ever. :)

answered 11 mins ago

Kaioo

New contributor

add a commentÂ |Â

up vote
0
down vote

this must be one of the best replies ever. :)

answered 11 mins ago

Kaioo

New contributor

add a commentÂ |Â

up vote
0
down vote

this must be one of the best replies ever. :)

answered 11 mins ago

Kaioo

New contributor

this must be one of the best replies ever. :)

answered 11 mins ago

Kaioo

New contributor

answered 11 mins ago

Kaioo

New contributor

answered 11 mins ago

Kaioo

answered 11 mins ago

Kaioo

New contributor

Kaioo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu