robots.txt is redirecting to default page
Clash Royale CLAN TAG#URR8PPP
Hullo,
Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour.
I have just one web server which does not. Instead, robots.txt redirects to the default web page (i.e. "thesiteinquestion.com/"). This notable difference (only one of seven sites) worries me.
Questions: Is this something to be concerned about? If so, what is the likely error that I am missing?
Notes:
- This site is the only one with a separate service provider that I
use. - CentOS release 6.10 (Final)
- Webmin
- robots.txt file permissions
are 644
redirect robots.txt
add a comment |
Hullo,
Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour.
I have just one web server which does not. Instead, robots.txt redirects to the default web page (i.e. "thesiteinquestion.com/"). This notable difference (only one of seven sites) worries me.
Questions: Is this something to be concerned about? If so, what is the likely error that I am missing?
Notes:
- This site is the only one with a separate service provider that I
use. - CentOS release 6.10 (Final)
- Webmin
- robots.txt file permissions
are 644
redirect robots.txt
add a comment |
Hullo,
Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour.
I have just one web server which does not. Instead, robots.txt redirects to the default web page (i.e. "thesiteinquestion.com/"). This notable difference (only one of seven sites) worries me.
Questions: Is this something to be concerned about? If so, what is the likely error that I am missing?
Notes:
- This site is the only one with a separate service provider that I
use. - CentOS release 6.10 (Final)
- Webmin
- robots.txt file permissions
are 644
redirect robots.txt
Hullo,
Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour.
I have just one web server which does not. Instead, robots.txt redirects to the default web page (i.e. "thesiteinquestion.com/"). This notable difference (only one of seven sites) worries me.
Questions: Is this something to be concerned about? If so, what is the likely error that I am missing?
Notes:
- This site is the only one with a separate service provider that I
use. - CentOS release 6.10 (Final)
- Webmin
- robots.txt file permissions
are 644
redirect robots.txt
redirect robots.txt
asked Feb 6 at 21:34
ParapluieParapluie
1157
1157
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
It depends on the server configuration, .txt files may not be allowed. It is possible that there is a rule somewhere in the config or some .htaccess that specifies if a url doesn't match a certain pattern (say .html, .php, .htm, etc) it then redirects the rest to the index page of the web root.
2
Well blue blistering barnacles! You are right. And I did it to myself with this rewrite:RewriteRule .(gif|jpg|js|txt)$ https://www.thesiteinquestion.com/index.php [L]
. I did this to prevent direct access, but I forgot that I added txt files as well. Comment it out, and it works a trice. Question: is there anyway to conditionally exclude files (this robots.txt file, in particular) from a rewrite?
– Parapluie
Feb 7 at 0:34
Wishing I could upvote this twice!
– Parapluie
Feb 7 at 0:34
@Parapluie possibly with a rule that allows robots.txt before the one you have there. I think the webserver would go sequentially through the rules and act on the first match. So if it matches robots.txt then it will act on that line.Examples here: serverfault.com/questions/213422/…
– Serge Rivest
Feb 11 at 22:09
add a comment |
To add a bit of information, the web provider is not at all forced to respect the robots.txt standard, thus can make what ever he want with it and like Serge told it can be redirected anywhere.
The "web provider" is not forced to respect the standard? Am I misunderstanding?: Do you mean the crawler?
– Parapluie
Feb 7 at 0:35
@Parapluie I mean the hoster is not forced to follow the robots.txt standard, and thus crawler must adapt to such case
– yagmoth555♦
Feb 7 at 0:37
That is interesting and germane. Thankfully, I have full access to the config in this case (even though my having access was the problem in the first place, at least I can fix it!) Thanks!
– Parapluie
Feb 7 at 0:40
add a comment |
A crawler should read robots.txt
and follow its restrictions, but the web server cannot enforce this.
.htaccess
(or the server confía file) can be used to keep out crawlers that don’t comply, if you know who they are.
Yes, indeed. I am currently using a jail script to ban IPs who ignore the robots.txt directives. i.e.
– Parapluie
Feb 8 at 16:45
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f952682%2frobots-txt-is-redirecting-to-default-page%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
It depends on the server configuration, .txt files may not be allowed. It is possible that there is a rule somewhere in the config or some .htaccess that specifies if a url doesn't match a certain pattern (say .html, .php, .htm, etc) it then redirects the rest to the index page of the web root.
2
Well blue blistering barnacles! You are right. And I did it to myself with this rewrite:RewriteRule .(gif|jpg|js|txt)$ https://www.thesiteinquestion.com/index.php [L]
. I did this to prevent direct access, but I forgot that I added txt files as well. Comment it out, and it works a trice. Question: is there anyway to conditionally exclude files (this robots.txt file, in particular) from a rewrite?
– Parapluie
Feb 7 at 0:34
Wishing I could upvote this twice!
– Parapluie
Feb 7 at 0:34
@Parapluie possibly with a rule that allows robots.txt before the one you have there. I think the webserver would go sequentially through the rules and act on the first match. So if it matches robots.txt then it will act on that line.Examples here: serverfault.com/questions/213422/…
– Serge Rivest
Feb 11 at 22:09
add a comment |
It depends on the server configuration, .txt files may not be allowed. It is possible that there is a rule somewhere in the config or some .htaccess that specifies if a url doesn't match a certain pattern (say .html, .php, .htm, etc) it then redirects the rest to the index page of the web root.
2
Well blue blistering barnacles! You are right. And I did it to myself with this rewrite:RewriteRule .(gif|jpg|js|txt)$ https://www.thesiteinquestion.com/index.php [L]
. I did this to prevent direct access, but I forgot that I added txt files as well. Comment it out, and it works a trice. Question: is there anyway to conditionally exclude files (this robots.txt file, in particular) from a rewrite?
– Parapluie
Feb 7 at 0:34
Wishing I could upvote this twice!
– Parapluie
Feb 7 at 0:34
@Parapluie possibly with a rule that allows robots.txt before the one you have there. I think the webserver would go sequentially through the rules and act on the first match. So if it matches robots.txt then it will act on that line.Examples here: serverfault.com/questions/213422/…
– Serge Rivest
Feb 11 at 22:09
add a comment |
It depends on the server configuration, .txt files may not be allowed. It is possible that there is a rule somewhere in the config or some .htaccess that specifies if a url doesn't match a certain pattern (say .html, .php, .htm, etc) it then redirects the rest to the index page of the web root.
It depends on the server configuration, .txt files may not be allowed. It is possible that there is a rule somewhere in the config or some .htaccess that specifies if a url doesn't match a certain pattern (say .html, .php, .htm, etc) it then redirects the rest to the index page of the web root.
answered Feb 6 at 22:09
Serge RivestSerge Rivest
661
661
2
Well blue blistering barnacles! You are right. And I did it to myself with this rewrite:RewriteRule .(gif|jpg|js|txt)$ https://www.thesiteinquestion.com/index.php [L]
. I did this to prevent direct access, but I forgot that I added txt files as well. Comment it out, and it works a trice. Question: is there anyway to conditionally exclude files (this robots.txt file, in particular) from a rewrite?
– Parapluie
Feb 7 at 0:34
Wishing I could upvote this twice!
– Parapluie
Feb 7 at 0:34
@Parapluie possibly with a rule that allows robots.txt before the one you have there. I think the webserver would go sequentially through the rules and act on the first match. So if it matches robots.txt then it will act on that line.Examples here: serverfault.com/questions/213422/…
– Serge Rivest
Feb 11 at 22:09
add a comment |
2
Well blue blistering barnacles! You are right. And I did it to myself with this rewrite:RewriteRule .(gif|jpg|js|txt)$ https://www.thesiteinquestion.com/index.php [L]
. I did this to prevent direct access, but I forgot that I added txt files as well. Comment it out, and it works a trice. Question: is there anyway to conditionally exclude files (this robots.txt file, in particular) from a rewrite?
– Parapluie
Feb 7 at 0:34
Wishing I could upvote this twice!
– Parapluie
Feb 7 at 0:34
@Parapluie possibly with a rule that allows robots.txt before the one you have there. I think the webserver would go sequentially through the rules and act on the first match. So if it matches robots.txt then it will act on that line.Examples here: serverfault.com/questions/213422/…
– Serge Rivest
Feb 11 at 22:09
2
2
Well blue blistering barnacles! You are right. And I did it to myself with this rewrite:
RewriteRule .(gif|jpg|js|txt)$ https://www.thesiteinquestion.com/index.php [L]
. I did this to prevent direct access, but I forgot that I added txt files as well. Comment it out, and it works a trice. Question: is there anyway to conditionally exclude files (this robots.txt file, in particular) from a rewrite?– Parapluie
Feb 7 at 0:34
Well blue blistering barnacles! You are right. And I did it to myself with this rewrite:
RewriteRule .(gif|jpg|js|txt)$ https://www.thesiteinquestion.com/index.php [L]
. I did this to prevent direct access, but I forgot that I added txt files as well. Comment it out, and it works a trice. Question: is there anyway to conditionally exclude files (this robots.txt file, in particular) from a rewrite?– Parapluie
Feb 7 at 0:34
Wishing I could upvote this twice!
– Parapluie
Feb 7 at 0:34
Wishing I could upvote this twice!
– Parapluie
Feb 7 at 0:34
@Parapluie possibly with a rule that allows robots.txt before the one you have there. I think the webserver would go sequentially through the rules and act on the first match. So if it matches robots.txt then it will act on that line.Examples here: serverfault.com/questions/213422/…
– Serge Rivest
Feb 11 at 22:09
@Parapluie possibly with a rule that allows robots.txt before the one you have there. I think the webserver would go sequentially through the rules and act on the first match. So if it matches robots.txt then it will act on that line.Examples here: serverfault.com/questions/213422/…
– Serge Rivest
Feb 11 at 22:09
add a comment |
To add a bit of information, the web provider is not at all forced to respect the robots.txt standard, thus can make what ever he want with it and like Serge told it can be redirected anywhere.
The "web provider" is not forced to respect the standard? Am I misunderstanding?: Do you mean the crawler?
– Parapluie
Feb 7 at 0:35
@Parapluie I mean the hoster is not forced to follow the robots.txt standard, and thus crawler must adapt to such case
– yagmoth555♦
Feb 7 at 0:37
That is interesting and germane. Thankfully, I have full access to the config in this case (even though my having access was the problem in the first place, at least I can fix it!) Thanks!
– Parapluie
Feb 7 at 0:40
add a comment |
To add a bit of information, the web provider is not at all forced to respect the robots.txt standard, thus can make what ever he want with it and like Serge told it can be redirected anywhere.
The "web provider" is not forced to respect the standard? Am I misunderstanding?: Do you mean the crawler?
– Parapluie
Feb 7 at 0:35
@Parapluie I mean the hoster is not forced to follow the robots.txt standard, and thus crawler must adapt to such case
– yagmoth555♦
Feb 7 at 0:37
That is interesting and germane. Thankfully, I have full access to the config in this case (even though my having access was the problem in the first place, at least I can fix it!) Thanks!
– Parapluie
Feb 7 at 0:40
add a comment |
To add a bit of information, the web provider is not at all forced to respect the robots.txt standard, thus can make what ever he want with it and like Serge told it can be redirected anywhere.
To add a bit of information, the web provider is not at all forced to respect the robots.txt standard, thus can make what ever he want with it and like Serge told it can be redirected anywhere.
answered Feb 6 at 22:18
yagmoth555♦yagmoth555
12k31842
12k31842
The "web provider" is not forced to respect the standard? Am I misunderstanding?: Do you mean the crawler?
– Parapluie
Feb 7 at 0:35
@Parapluie I mean the hoster is not forced to follow the robots.txt standard, and thus crawler must adapt to such case
– yagmoth555♦
Feb 7 at 0:37
That is interesting and germane. Thankfully, I have full access to the config in this case (even though my having access was the problem in the first place, at least I can fix it!) Thanks!
– Parapluie
Feb 7 at 0:40
add a comment |
The "web provider" is not forced to respect the standard? Am I misunderstanding?: Do you mean the crawler?
– Parapluie
Feb 7 at 0:35
@Parapluie I mean the hoster is not forced to follow the robots.txt standard, and thus crawler must adapt to such case
– yagmoth555♦
Feb 7 at 0:37
That is interesting and germane. Thankfully, I have full access to the config in this case (even though my having access was the problem in the first place, at least I can fix it!) Thanks!
– Parapluie
Feb 7 at 0:40
The "web provider" is not forced to respect the standard? Am I misunderstanding?: Do you mean the crawler?
– Parapluie
Feb 7 at 0:35
The "web provider" is not forced to respect the standard? Am I misunderstanding?: Do you mean the crawler?
– Parapluie
Feb 7 at 0:35
@Parapluie I mean the hoster is not forced to follow the robots.txt standard, and thus crawler must adapt to such case
– yagmoth555♦
Feb 7 at 0:37
@Parapluie I mean the hoster is not forced to follow the robots.txt standard, and thus crawler must adapt to such case
– yagmoth555♦
Feb 7 at 0:37
That is interesting and germane. Thankfully, I have full access to the config in this case (even though my having access was the problem in the first place, at least I can fix it!) Thanks!
– Parapluie
Feb 7 at 0:40
That is interesting and germane. Thankfully, I have full access to the config in this case (even though my having access was the problem in the first place, at least I can fix it!) Thanks!
– Parapluie
Feb 7 at 0:40
add a comment |
A crawler should read robots.txt
and follow its restrictions, but the web server cannot enforce this.
.htaccess
(or the server confía file) can be used to keep out crawlers that don’t comply, if you know who they are.
Yes, indeed. I am currently using a jail script to ban IPs who ignore the robots.txt directives. i.e.
– Parapluie
Feb 8 at 16:45
add a comment |
A crawler should read robots.txt
and follow its restrictions, but the web server cannot enforce this.
.htaccess
(or the server confía file) can be used to keep out crawlers that don’t comply, if you know who they are.
Yes, indeed. I am currently using a jail script to ban IPs who ignore the robots.txt directives. i.e.
– Parapluie
Feb 8 at 16:45
add a comment |
A crawler should read robots.txt
and follow its restrictions, but the web server cannot enforce this.
.htaccess
(or the server confía file) can be used to keep out crawlers that don’t comply, if you know who they are.
A crawler should read robots.txt
and follow its restrictions, but the web server cannot enforce this.
.htaccess
(or the server confía file) can be used to keep out crawlers that don’t comply, if you know who they are.
edited Feb 8 at 4:15
answered Feb 8 at 2:15
WGroleauWGroleau
1113
1113
Yes, indeed. I am currently using a jail script to ban IPs who ignore the robots.txt directives. i.e.
– Parapluie
Feb 8 at 16:45
add a comment |
Yes, indeed. I am currently using a jail script to ban IPs who ignore the robots.txt directives. i.e.
– Parapluie
Feb 8 at 16:45
Yes, indeed. I am currently using a jail script to ban IPs who ignore the robots.txt directives. i.e.
– Parapluie
Feb 8 at 16:45
Yes, indeed. I am currently using a jail script to ban IPs who ignore the robots.txt directives. i.e.
– Parapluie
Feb 8 at 16:45
add a comment |
Thanks for contributing an answer to Server Fault!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f952682%2frobots-txt-is-redirecting-to-default-page%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown