How does Google protect against scraping?
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
I'm trying to implement security against scraping on my website to prevent basic scraping techniques.
Google seems to have a very good protection against scraping, but it's so good that I'm unable to understand its mechanism.
I was trying to make an http GET request as "normal user" using normal browser headers and query parameters.
It was working all fine before certain number of requests, then it displayed 503 error page notifying me that unusual traffic was detected, it also contained my external ip address.
What's weird, is that from my normal chrome browser there were absoloutely no errors when making request to that certain url, but with my custom http requests it kept displaying status 503.
I was almost certain that proxy server could bypass such protection, but I was wrong - even though website displayed different ip address, I kept receiving status 503 error.
Request information
Main
----
Method: GET
URL: https://www.google.com/search
Data (Query parameers)
----------------------
q: "this+is+example"
ie: utf-8
oe: utf-8
start: 0
Headers
-------
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'
The information that was sent from my browser was generated by Chrome - I was logged in, therefore session cookies were sent within the headers as well.
If not http rate IP rate limiting and cookie rate limiting, how could Google identify such scraping bot? Is there any other method that can offer such protection?
http google protection
add a comment |Â
up vote
4
down vote
favorite
I'm trying to implement security against scraping on my website to prevent basic scraping techniques.
Google seems to have a very good protection against scraping, but it's so good that I'm unable to understand its mechanism.
I was trying to make an http GET request as "normal user" using normal browser headers and query parameters.
It was working all fine before certain number of requests, then it displayed 503 error page notifying me that unusual traffic was detected, it also contained my external ip address.
What's weird, is that from my normal chrome browser there were absoloutely no errors when making request to that certain url, but with my custom http requests it kept displaying status 503.
I was almost certain that proxy server could bypass such protection, but I was wrong - even though website displayed different ip address, I kept receiving status 503 error.
Request information
Main
----
Method: GET
URL: https://www.google.com/search
Data (Query parameers)
----------------------
q: "this+is+example"
ie: utf-8
oe: utf-8
start: 0
Headers
-------
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'
The information that was sent from my browser was generated by Chrome - I was logged in, therefore session cookies were sent within the headers as well.
If not http rate IP rate limiting and cookie rate limiting, how could Google identify such scraping bot? Is there any other method that can offer such protection?
http google protection
1
Interostdng question, but it's pretty light on details to help us figure it out. Can you add details about how your script works? Maybe a side-by-side comparison of your script's messages and your browser's (including http headers). And when you start getting the 503?
â Mike Ounsworth
Aug 13 at 12:53
Yeah, you seem to be discounting the fact that you are using a custom script to access the site and it is that that is being detected.
â schroederâ¦
Aug 13 at 12:54
1
Are you signed into Google in the Chrome browser? That may be preventing the 503 error from appearing in Chrome as it sees you as a potential legitimate user.
â MoonRunestar
Aug 13 at 13:49
@MoonRunestar I actually thought of that as well, perhaps that might be the reason. But when I deleted the session cookie data, I could still access the website query service.
â ShellRox
Aug 13 at 14:50
@schroeder Of course, but that's the security features I'm trying to implement. Method that Google may potentially utilize for identifying custom scripts.
â ShellRox
Aug 13 at 14:56
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I'm trying to implement security against scraping on my website to prevent basic scraping techniques.
Google seems to have a very good protection against scraping, but it's so good that I'm unable to understand its mechanism.
I was trying to make an http GET request as "normal user" using normal browser headers and query parameters.
It was working all fine before certain number of requests, then it displayed 503 error page notifying me that unusual traffic was detected, it also contained my external ip address.
What's weird, is that from my normal chrome browser there were absoloutely no errors when making request to that certain url, but with my custom http requests it kept displaying status 503.
I was almost certain that proxy server could bypass such protection, but I was wrong - even though website displayed different ip address, I kept receiving status 503 error.
Request information
Main
----
Method: GET
URL: https://www.google.com/search
Data (Query parameers)
----------------------
q: "this+is+example"
ie: utf-8
oe: utf-8
start: 0
Headers
-------
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'
The information that was sent from my browser was generated by Chrome - I was logged in, therefore session cookies were sent within the headers as well.
If not http rate IP rate limiting and cookie rate limiting, how could Google identify such scraping bot? Is there any other method that can offer such protection?
http google protection
I'm trying to implement security against scraping on my website to prevent basic scraping techniques.
Google seems to have a very good protection against scraping, but it's so good that I'm unable to understand its mechanism.
I was trying to make an http GET request as "normal user" using normal browser headers and query parameters.
It was working all fine before certain number of requests, then it displayed 503 error page notifying me that unusual traffic was detected, it also contained my external ip address.
What's weird, is that from my normal chrome browser there were absoloutely no errors when making request to that certain url, but with my custom http requests it kept displaying status 503.
I was almost certain that proxy server could bypass such protection, but I was wrong - even though website displayed different ip address, I kept receiving status 503 error.
Request information
Main
----
Method: GET
URL: https://www.google.com/search
Data (Query parameers)
----------------------
q: "this+is+example"
ie: utf-8
oe: utf-8
start: 0
Headers
-------
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'
The information that was sent from my browser was generated by Chrome - I was logged in, therefore session cookies were sent within the headers as well.
If not http rate IP rate limiting and cookie rate limiting, how could Google identify such scraping bot? Is there any other method that can offer such protection?
http google protection
http google protection
edited Aug 13 at 14:48
asked Aug 13 at 12:43
ShellRox
1316
1316
1
Interostdng question, but it's pretty light on details to help us figure it out. Can you add details about how your script works? Maybe a side-by-side comparison of your script's messages and your browser's (including http headers). And when you start getting the 503?
â Mike Ounsworth
Aug 13 at 12:53
Yeah, you seem to be discounting the fact that you are using a custom script to access the site and it is that that is being detected.
â schroederâ¦
Aug 13 at 12:54
1
Are you signed into Google in the Chrome browser? That may be preventing the 503 error from appearing in Chrome as it sees you as a potential legitimate user.
â MoonRunestar
Aug 13 at 13:49
@MoonRunestar I actually thought of that as well, perhaps that might be the reason. But when I deleted the session cookie data, I could still access the website query service.
â ShellRox
Aug 13 at 14:50
@schroeder Of course, but that's the security features I'm trying to implement. Method that Google may potentially utilize for identifying custom scripts.
â ShellRox
Aug 13 at 14:56
add a comment |Â
1
Interostdng question, but it's pretty light on details to help us figure it out. Can you add details about how your script works? Maybe a side-by-side comparison of your script's messages and your browser's (including http headers). And when you start getting the 503?
â Mike Ounsworth
Aug 13 at 12:53
Yeah, you seem to be discounting the fact that you are using a custom script to access the site and it is that that is being detected.
â schroederâ¦
Aug 13 at 12:54
1
Are you signed into Google in the Chrome browser? That may be preventing the 503 error from appearing in Chrome as it sees you as a potential legitimate user.
â MoonRunestar
Aug 13 at 13:49
@MoonRunestar I actually thought of that as well, perhaps that might be the reason. But when I deleted the session cookie data, I could still access the website query service.
â ShellRox
Aug 13 at 14:50
@schroeder Of course, but that's the security features I'm trying to implement. Method that Google may potentially utilize for identifying custom scripts.
â ShellRox
Aug 13 at 14:56
1
1
Interostdng question, but it's pretty light on details to help us figure it out. Can you add details about how your script works? Maybe a side-by-side comparison of your script's messages and your browser's (including http headers). And when you start getting the 503?
â Mike Ounsworth
Aug 13 at 12:53
Interostdng question, but it's pretty light on details to help us figure it out. Can you add details about how your script works? Maybe a side-by-side comparison of your script's messages and your browser's (including http headers). And when you start getting the 503?
â Mike Ounsworth
Aug 13 at 12:53
Yeah, you seem to be discounting the fact that you are using a custom script to access the site and it is that that is being detected.
â schroederâ¦
Aug 13 at 12:54
Yeah, you seem to be discounting the fact that you are using a custom script to access the site and it is that that is being detected.
â schroederâ¦
Aug 13 at 12:54
1
1
Are you signed into Google in the Chrome browser? That may be preventing the 503 error from appearing in Chrome as it sees you as a potential legitimate user.
â MoonRunestar
Aug 13 at 13:49
Are you signed into Google in the Chrome browser? That may be preventing the 503 error from appearing in Chrome as it sees you as a potential legitimate user.
â MoonRunestar
Aug 13 at 13:49
@MoonRunestar I actually thought of that as well, perhaps that might be the reason. But when I deleted the session cookie data, I could still access the website query service.
â ShellRox
Aug 13 at 14:50
@MoonRunestar I actually thought of that as well, perhaps that might be the reason. But when I deleted the session cookie data, I could still access the website query service.
â ShellRox
Aug 13 at 14:50
@schroeder Of course, but that's the security features I'm trying to implement. Method that Google may potentially utilize for identifying custom scripts.
â ShellRox
Aug 13 at 14:56
@schroeder Of course, but that's the security features I'm trying to implement. Method that Google may potentially utilize for identifying custom scripts.
â ShellRox
Aug 13 at 14:56
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
6
down vote
accepted
One "obvious" way that comes to mind (but I have no idea whether Google does this) is it is looking for related requests that a browser would generate after retrieving the main page.
A browser will retrieve the main URL and then (for a typical page) request several additional items: JavaScript files, images, CSS files etc.
Depending on how you're scripting the get (e.g. you only mention "make an HTTP GET request") if it sees repeated requests for "main pages", but no interleaved requests for .js/.css/.jpg files, then it might assume you are a script.
That seems like a very good method for identifying unusual traffic, thank you! But shouldn't it happen right away? Google can somehow block the request without being dependent on ip address or cookie information, I wonder what that method really cold be.
â ShellRox
Aug 13 at 14:54
1
If they do something like this, I guess they let a couple through to not exclude people hittingF5
repeatedly or other "normal" issues. After "a few" main-page-only requests, they could block.
â TripeHound
Aug 13 at 14:59
It's a little weird, identification is a one process but blacklisting some specific information that is associated with user is the second. IP whitelisting could easily be bypassed by proxy servers or in some cases by modifying host headers, plus IP whitelisting might not be optimal for "user friendliness". Google's method can whitelist such potentially malicious methods without disturbing the normal end-user at all.
â ShellRox
Aug 13 at 15:15
1
Another factor it could be working off is the lack of "normally expected" headers in your scripted requests that a browser would normally include (e.g. looking at what Chrome sent for this page: accept, accept-encoding, accept-language, cache-control, cookie, dnt (?), referer, upgrade-insecure-requests).
â TripeHound
Aug 13 at 15:30
Yes, I believe that would be the most logical case - perhaps cookies and few other headers have priority in such cases.
â ShellRox
Aug 13 at 15:47
 |Â
show 1 more comment
up vote
1
down vote
If you were to try to bot it to any significant degree you will get asked for a captcha. This was really annoying the two weeks I was restricted to w3m web browser. They thought I was a bot.
A call or two will get past just fine, but if you were to try any serious amount, the captcha demand will raise. I've hit it from time to time by hand.
They not only monitor single IP addresses, but Class C network ranges. We hit this in college occasionally. Too many too fast from the same class C can raise as well. I think this check is suppressed from properly logged-in clients but they will notice if the same logged-in user is active too much.
They actually have deep characterization analysis that can identify users w/o being logged in, which you have no hope to replicate. Google claimed once (that I can't find now) they had the ability to unmask private browsing but chose to not do so.
I did not down vote, by the way. it's just interesting that they can implement such algorithm that can be so user-friendly yet protected against bots.
â ShellRox
Aug 13 at 18:49
@ShellRox: I think they don't care so long as the bot acts like a human.
â Joshua
Aug 13 at 18:51
Of course, I could implement similar system on my website, that rate limits the IP addresses. But the problem is in the second step after identifying bot, how is whitelisting process done so efficiently? IP and cookie whitelisting can be easily bypassed by bot yet they can somehow blacklist bots with more efficient methods.
â ShellRox
Aug 13 at 18:54
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
6
down vote
accepted
One "obvious" way that comes to mind (but I have no idea whether Google does this) is it is looking for related requests that a browser would generate after retrieving the main page.
A browser will retrieve the main URL and then (for a typical page) request several additional items: JavaScript files, images, CSS files etc.
Depending on how you're scripting the get (e.g. you only mention "make an HTTP GET request") if it sees repeated requests for "main pages", but no interleaved requests for .js/.css/.jpg files, then it might assume you are a script.
That seems like a very good method for identifying unusual traffic, thank you! But shouldn't it happen right away? Google can somehow block the request without being dependent on ip address or cookie information, I wonder what that method really cold be.
â ShellRox
Aug 13 at 14:54
1
If they do something like this, I guess they let a couple through to not exclude people hittingF5
repeatedly or other "normal" issues. After "a few" main-page-only requests, they could block.
â TripeHound
Aug 13 at 14:59
It's a little weird, identification is a one process but blacklisting some specific information that is associated with user is the second. IP whitelisting could easily be bypassed by proxy servers or in some cases by modifying host headers, plus IP whitelisting might not be optimal for "user friendliness". Google's method can whitelist such potentially malicious methods without disturbing the normal end-user at all.
â ShellRox
Aug 13 at 15:15
1
Another factor it could be working off is the lack of "normally expected" headers in your scripted requests that a browser would normally include (e.g. looking at what Chrome sent for this page: accept, accept-encoding, accept-language, cache-control, cookie, dnt (?), referer, upgrade-insecure-requests).
â TripeHound
Aug 13 at 15:30
Yes, I believe that would be the most logical case - perhaps cookies and few other headers have priority in such cases.
â ShellRox
Aug 13 at 15:47
 |Â
show 1 more comment
up vote
6
down vote
accepted
One "obvious" way that comes to mind (but I have no idea whether Google does this) is it is looking for related requests that a browser would generate after retrieving the main page.
A browser will retrieve the main URL and then (for a typical page) request several additional items: JavaScript files, images, CSS files etc.
Depending on how you're scripting the get (e.g. you only mention "make an HTTP GET request") if it sees repeated requests for "main pages", but no interleaved requests for .js/.css/.jpg files, then it might assume you are a script.
That seems like a very good method for identifying unusual traffic, thank you! But shouldn't it happen right away? Google can somehow block the request without being dependent on ip address or cookie information, I wonder what that method really cold be.
â ShellRox
Aug 13 at 14:54
1
If they do something like this, I guess they let a couple through to not exclude people hittingF5
repeatedly or other "normal" issues. After "a few" main-page-only requests, they could block.
â TripeHound
Aug 13 at 14:59
It's a little weird, identification is a one process but blacklisting some specific information that is associated with user is the second. IP whitelisting could easily be bypassed by proxy servers or in some cases by modifying host headers, plus IP whitelisting might not be optimal for "user friendliness". Google's method can whitelist such potentially malicious methods without disturbing the normal end-user at all.
â ShellRox
Aug 13 at 15:15
1
Another factor it could be working off is the lack of "normally expected" headers in your scripted requests that a browser would normally include (e.g. looking at what Chrome sent for this page: accept, accept-encoding, accept-language, cache-control, cookie, dnt (?), referer, upgrade-insecure-requests).
â TripeHound
Aug 13 at 15:30
Yes, I believe that would be the most logical case - perhaps cookies and few other headers have priority in such cases.
â ShellRox
Aug 13 at 15:47
 |Â
show 1 more comment
up vote
6
down vote
accepted
up vote
6
down vote
accepted
One "obvious" way that comes to mind (but I have no idea whether Google does this) is it is looking for related requests that a browser would generate after retrieving the main page.
A browser will retrieve the main URL and then (for a typical page) request several additional items: JavaScript files, images, CSS files etc.
Depending on how you're scripting the get (e.g. you only mention "make an HTTP GET request") if it sees repeated requests for "main pages", but no interleaved requests for .js/.css/.jpg files, then it might assume you are a script.
One "obvious" way that comes to mind (but I have no idea whether Google does this) is it is looking for related requests that a browser would generate after retrieving the main page.
A browser will retrieve the main URL and then (for a typical page) request several additional items: JavaScript files, images, CSS files etc.
Depending on how you're scripting the get (e.g. you only mention "make an HTTP GET request") if it sees repeated requests for "main pages", but no interleaved requests for .js/.css/.jpg files, then it might assume you are a script.
answered Aug 13 at 14:33
TripeHound
18615
18615
That seems like a very good method for identifying unusual traffic, thank you! But shouldn't it happen right away? Google can somehow block the request without being dependent on ip address or cookie information, I wonder what that method really cold be.
â ShellRox
Aug 13 at 14:54
1
If they do something like this, I guess they let a couple through to not exclude people hittingF5
repeatedly or other "normal" issues. After "a few" main-page-only requests, they could block.
â TripeHound
Aug 13 at 14:59
It's a little weird, identification is a one process but blacklisting some specific information that is associated with user is the second. IP whitelisting could easily be bypassed by proxy servers or in some cases by modifying host headers, plus IP whitelisting might not be optimal for "user friendliness". Google's method can whitelist such potentially malicious methods without disturbing the normal end-user at all.
â ShellRox
Aug 13 at 15:15
1
Another factor it could be working off is the lack of "normally expected" headers in your scripted requests that a browser would normally include (e.g. looking at what Chrome sent for this page: accept, accept-encoding, accept-language, cache-control, cookie, dnt (?), referer, upgrade-insecure-requests).
â TripeHound
Aug 13 at 15:30
Yes, I believe that would be the most logical case - perhaps cookies and few other headers have priority in such cases.
â ShellRox
Aug 13 at 15:47
 |Â
show 1 more comment
That seems like a very good method for identifying unusual traffic, thank you! But shouldn't it happen right away? Google can somehow block the request without being dependent on ip address or cookie information, I wonder what that method really cold be.
â ShellRox
Aug 13 at 14:54
1
If they do something like this, I guess they let a couple through to not exclude people hittingF5
repeatedly or other "normal" issues. After "a few" main-page-only requests, they could block.
â TripeHound
Aug 13 at 14:59
It's a little weird, identification is a one process but blacklisting some specific information that is associated with user is the second. IP whitelisting could easily be bypassed by proxy servers or in some cases by modifying host headers, plus IP whitelisting might not be optimal for "user friendliness". Google's method can whitelist such potentially malicious methods without disturbing the normal end-user at all.
â ShellRox
Aug 13 at 15:15
1
Another factor it could be working off is the lack of "normally expected" headers in your scripted requests that a browser would normally include (e.g. looking at what Chrome sent for this page: accept, accept-encoding, accept-language, cache-control, cookie, dnt (?), referer, upgrade-insecure-requests).
â TripeHound
Aug 13 at 15:30
Yes, I believe that would be the most logical case - perhaps cookies and few other headers have priority in such cases.
â ShellRox
Aug 13 at 15:47
That seems like a very good method for identifying unusual traffic, thank you! But shouldn't it happen right away? Google can somehow block the request without being dependent on ip address or cookie information, I wonder what that method really cold be.
â ShellRox
Aug 13 at 14:54
That seems like a very good method for identifying unusual traffic, thank you! But shouldn't it happen right away? Google can somehow block the request without being dependent on ip address or cookie information, I wonder what that method really cold be.
â ShellRox
Aug 13 at 14:54
1
1
If they do something like this, I guess they let a couple through to not exclude people hitting
F5
repeatedly or other "normal" issues. After "a few" main-page-only requests, they could block.â TripeHound
Aug 13 at 14:59
If they do something like this, I guess they let a couple through to not exclude people hitting
F5
repeatedly or other "normal" issues. After "a few" main-page-only requests, they could block.â TripeHound
Aug 13 at 14:59
It's a little weird, identification is a one process but blacklisting some specific information that is associated with user is the second. IP whitelisting could easily be bypassed by proxy servers or in some cases by modifying host headers, plus IP whitelisting might not be optimal for "user friendliness". Google's method can whitelist such potentially malicious methods without disturbing the normal end-user at all.
â ShellRox
Aug 13 at 15:15
It's a little weird, identification is a one process but blacklisting some specific information that is associated with user is the second. IP whitelisting could easily be bypassed by proxy servers or in some cases by modifying host headers, plus IP whitelisting might not be optimal for "user friendliness". Google's method can whitelist such potentially malicious methods without disturbing the normal end-user at all.
â ShellRox
Aug 13 at 15:15
1
1
Another factor it could be working off is the lack of "normally expected" headers in your scripted requests that a browser would normally include (e.g. looking at what Chrome sent for this page: accept, accept-encoding, accept-language, cache-control, cookie, dnt (?), referer, upgrade-insecure-requests).
â TripeHound
Aug 13 at 15:30
Another factor it could be working off is the lack of "normally expected" headers in your scripted requests that a browser would normally include (e.g. looking at what Chrome sent for this page: accept, accept-encoding, accept-language, cache-control, cookie, dnt (?), referer, upgrade-insecure-requests).
â TripeHound
Aug 13 at 15:30
Yes, I believe that would be the most logical case - perhaps cookies and few other headers have priority in such cases.
â ShellRox
Aug 13 at 15:47
Yes, I believe that would be the most logical case - perhaps cookies and few other headers have priority in such cases.
â ShellRox
Aug 13 at 15:47
 |Â
show 1 more comment
up vote
1
down vote
If you were to try to bot it to any significant degree you will get asked for a captcha. This was really annoying the two weeks I was restricted to w3m web browser. They thought I was a bot.
A call or two will get past just fine, but if you were to try any serious amount, the captcha demand will raise. I've hit it from time to time by hand.
They not only monitor single IP addresses, but Class C network ranges. We hit this in college occasionally. Too many too fast from the same class C can raise as well. I think this check is suppressed from properly logged-in clients but they will notice if the same logged-in user is active too much.
They actually have deep characterization analysis that can identify users w/o being logged in, which you have no hope to replicate. Google claimed once (that I can't find now) they had the ability to unmask private browsing but chose to not do so.
I did not down vote, by the way. it's just interesting that they can implement such algorithm that can be so user-friendly yet protected against bots.
â ShellRox
Aug 13 at 18:49
@ShellRox: I think they don't care so long as the bot acts like a human.
â Joshua
Aug 13 at 18:51
Of course, I could implement similar system on my website, that rate limits the IP addresses. But the problem is in the second step after identifying bot, how is whitelisting process done so efficiently? IP and cookie whitelisting can be easily bypassed by bot yet they can somehow blacklist bots with more efficient methods.
â ShellRox
Aug 13 at 18:54
add a comment |Â
up vote
1
down vote
If you were to try to bot it to any significant degree you will get asked for a captcha. This was really annoying the two weeks I was restricted to w3m web browser. They thought I was a bot.
A call or two will get past just fine, but if you were to try any serious amount, the captcha demand will raise. I've hit it from time to time by hand.
They not only monitor single IP addresses, but Class C network ranges. We hit this in college occasionally. Too many too fast from the same class C can raise as well. I think this check is suppressed from properly logged-in clients but they will notice if the same logged-in user is active too much.
They actually have deep characterization analysis that can identify users w/o being logged in, which you have no hope to replicate. Google claimed once (that I can't find now) they had the ability to unmask private browsing but chose to not do so.
I did not down vote, by the way. it's just interesting that they can implement such algorithm that can be so user-friendly yet protected against bots.
â ShellRox
Aug 13 at 18:49
@ShellRox: I think they don't care so long as the bot acts like a human.
â Joshua
Aug 13 at 18:51
Of course, I could implement similar system on my website, that rate limits the IP addresses. But the problem is in the second step after identifying bot, how is whitelisting process done so efficiently? IP and cookie whitelisting can be easily bypassed by bot yet they can somehow blacklist bots with more efficient methods.
â ShellRox
Aug 13 at 18:54
add a comment |Â
up vote
1
down vote
up vote
1
down vote
If you were to try to bot it to any significant degree you will get asked for a captcha. This was really annoying the two weeks I was restricted to w3m web browser. They thought I was a bot.
A call or two will get past just fine, but if you were to try any serious amount, the captcha demand will raise. I've hit it from time to time by hand.
They not only monitor single IP addresses, but Class C network ranges. We hit this in college occasionally. Too many too fast from the same class C can raise as well. I think this check is suppressed from properly logged-in clients but they will notice if the same logged-in user is active too much.
They actually have deep characterization analysis that can identify users w/o being logged in, which you have no hope to replicate. Google claimed once (that I can't find now) they had the ability to unmask private browsing but chose to not do so.
If you were to try to bot it to any significant degree you will get asked for a captcha. This was really annoying the two weeks I was restricted to w3m web browser. They thought I was a bot.
A call or two will get past just fine, but if you were to try any serious amount, the captcha demand will raise. I've hit it from time to time by hand.
They not only monitor single IP addresses, but Class C network ranges. We hit this in college occasionally. Too many too fast from the same class C can raise as well. I think this check is suppressed from properly logged-in clients but they will notice if the same logged-in user is active too much.
They actually have deep characterization analysis that can identify users w/o being logged in, which you have no hope to replicate. Google claimed once (that I can't find now) they had the ability to unmask private browsing but chose to not do so.
edited Aug 13 at 19:05
answered Aug 13 at 18:22
Joshua
48537
48537
I did not down vote, by the way. it's just interesting that they can implement such algorithm that can be so user-friendly yet protected against bots.
â ShellRox
Aug 13 at 18:49
@ShellRox: I think they don't care so long as the bot acts like a human.
â Joshua
Aug 13 at 18:51
Of course, I could implement similar system on my website, that rate limits the IP addresses. But the problem is in the second step after identifying bot, how is whitelisting process done so efficiently? IP and cookie whitelisting can be easily bypassed by bot yet they can somehow blacklist bots with more efficient methods.
â ShellRox
Aug 13 at 18:54
add a comment |Â
I did not down vote, by the way. it's just interesting that they can implement such algorithm that can be so user-friendly yet protected against bots.
â ShellRox
Aug 13 at 18:49
@ShellRox: I think they don't care so long as the bot acts like a human.
â Joshua
Aug 13 at 18:51
Of course, I could implement similar system on my website, that rate limits the IP addresses. But the problem is in the second step after identifying bot, how is whitelisting process done so efficiently? IP and cookie whitelisting can be easily bypassed by bot yet they can somehow blacklist bots with more efficient methods.
â ShellRox
Aug 13 at 18:54
I did not down vote, by the way. it's just interesting that they can implement such algorithm that can be so user-friendly yet protected against bots.
â ShellRox
Aug 13 at 18:49
I did not down vote, by the way. it's just interesting that they can implement such algorithm that can be so user-friendly yet protected against bots.
â ShellRox
Aug 13 at 18:49
@ShellRox: I think they don't care so long as the bot acts like a human.
â Joshua
Aug 13 at 18:51
@ShellRox: I think they don't care so long as the bot acts like a human.
â Joshua
Aug 13 at 18:51
Of course, I could implement similar system on my website, that rate limits the IP addresses. But the problem is in the second step after identifying bot, how is whitelisting process done so efficiently? IP and cookie whitelisting can be easily bypassed by bot yet they can somehow blacklist bots with more efficient methods.
â ShellRox
Aug 13 at 18:54
Of course, I could implement similar system on my website, that rate limits the IP addresses. But the problem is in the second step after identifying bot, how is whitelisting process done so efficiently? IP and cookie whitelisting can be easily bypassed by bot yet they can somehow blacklist bots with more efficient methods.
â ShellRox
Aug 13 at 18:54
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsecurity.stackexchange.com%2fquestions%2f191470%2fhow-does-google-protect-against-scraping%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Interostdng question, but it's pretty light on details to help us figure it out. Can you add details about how your script works? Maybe a side-by-side comparison of your script's messages and your browser's (including http headers). And when you start getting the 503?
â Mike Ounsworth
Aug 13 at 12:53
Yeah, you seem to be discounting the fact that you are using a custom script to access the site and it is that that is being detected.
â schroederâ¦
Aug 13 at 12:54
1
Are you signed into Google in the Chrome browser? That may be preventing the 503 error from appearing in Chrome as it sees you as a potential legitimate user.
â MoonRunestar
Aug 13 at 13:49
@MoonRunestar I actually thought of that as well, perhaps that might be the reason. But when I deleted the session cookie data, I could still access the website query service.
â ShellRox
Aug 13 at 14:50
@schroeder Of course, but that's the security features I'm trying to implement. Method that Google may potentially utilize for identifying custom scripts.
â ShellRox
Aug 13 at 14:56