Googlebot: Google's Web Crawler
Googlebot is Google's web-crawling robot. It collects documents from the web to build a searchable index for the Google search engine. On this page, you'll find answers to the most commonly asked questions about how our web crawler works.
1. How often will Googlebot access my web pages?
For most sites, Googlebot shouldn't access your site more than once every few seconds on average. However, due to network delays, it's possible that the rate will appear to be slightly higher over short periods.
2. How do I request that Google not crawl parts or all of my site?
robots.txt is a standard document that can tell Googlebot not to download some or all information from your web server. The format of the robots.txt file is specified in the Robot Exclusion Standard. For detailed instructions about how to prevent Googlebot from crawling all or part of your site, please refer to our Removals page. Remember, changes to your server's robots.txt file won't be immediately reflected in Google; they'll be discovered and used when Googlebot next crawls your site.
3. Googlebot is crawling my site too fast. What can I do?
Please contact us with the URL of your site and a detailed description of the problem. Please also include a portion of the weblog that shows Google accesses so we can track down the problem quickly.
4. Why is Googlebot asking for a file called robots.txt that isn't on my server?
robots.txt is a standard document that can tell Googlebot not to download some or all information from your web server. For information on how to create a robots.txt file, see The Robot Exclusion Standard. If you just want to prevent the "file not found" error messages in your web server log, you can create an empty file named robots.txt.
5. Why is Googlebot trying to download incorrect links from my server? Or from a server that doesn't exist?
It's a given that many links on the web will be broken or outdated at any particular time. Whenever someone publishes an incorrect link to your site (perhaps due to a typo or spelling error) or fails to update links to reflect changes in your server, Googlebot will try to download an incorrect link from your site. This also explains why you may get hits on a machine that's not even a web server.
6. Why is Googlebot downloading information from our "secret" web server?
It's almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. So, if there's a link to your "secret" web server or page on the web anywhere, it's likely that Googlebot and other web crawlers will find it.
7. Why isn't Googlebot obeying my robots.txt file?
To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.
We always suggest verifying that your syntax is correct against the standard at http://www.robotstxt.org/wc/exclusion.html#robotstxt. A common source of problems is that the robots.txt file isn't placed in the top directory of the server (e.g., www.myhost.com/robots.txt); placing the file in a subdirectory won't have any effect.
Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:
It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.
8. Why are there hits from multiple machines at Google.com, all with user-agent Googlebot?
Googlebot was designed to be distributed on several machines to improve performance and scale as the web grows. Also, to cut down on bandwidth usage, we run many crawlers on machines located near the sites they're indexing in the network.
9. Can you tell me the IP addresses from which Googlebot crawls so that I can filter my logs?
The IP addresses used by Googlebot change from time to time. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).
10. Why is Googlebot downloading the same page on my site multiple times?
In general, Googlebot should only download one copy of each file from your site during a given crawl. Very occasionally the crawler is stopped and restarted, which may cause it to recrawl pages that it's recently retrieved.
11. Why don't the pages of my site that Googlebot crawled show up in your index?
Don't be alarmed if you can't immediately find documents that Googlebot has crawled in the Google search engine. Documents are entered into our index soon after being crawled. Occasionally, documents fetched by Googlebot won't be included for various reasons (e.g. they appear to be duplicates of other pages on the web).
12. What kinds of links does Googlebot follow?
Googlebot follows HREF links and SRC links.
13. How do I prevent Googlebot from following links on my pages?
To keep Googlebot from following links on your pages to other pages or documents, you'd place the following meta tag in the head of your HTML document:
<META NAME="Googlebot" CONTENT="nofollow">
To learn more about meta tags, please refer to http://www.robotstxt.org/wc/exclusion.html#meta; you can also read what the HTML standard has to say about these tags. Remember, changes to your site won't be immediately reflected in Google; they'll be discovered and propagate when Googlebot next crawls your site.
14. How do I tell Googlebot not to crawl a single outgoing link on a page?
Meta tags can exclude all outgoing links on a page, but you can also instruct Googlebot not to crawl individual links by adding rel="nofollow" to a hyperlink. When Google sees the attribute rel="nofollow" on hyperlinks, those links won't get any credit when we rank websites in our search results. For example a link,
<a href=http://www.example.com/>This is a great link!</a>
could be replaced with
<a href=http://www.example.com/ rel="nofollow">I can't vouch for this link</a>.
15. What is Feedfetcher, and why is it ignoring my robots.txt file?
Feedfetcher requests come from explicit action by human users. When users add your feed to their Google homepage or to Google Reader, Google's Feedfetcher attempts to obtain the content of the feed in order to display it. Since all requests come from humans, Feedfetcher has been designed to ignore robots.txt. Learn more.
16. How do I add my feed to the search results for Google's personalized homepage and Google Reader?
The feeds that Googlebot crawls appear in the search results for Google's personalized homepage and Google Reader. To ensure that your feed is part of this index, add a <link> tag to the header of your webpage to enable feed autodiscovery. There are a lot of variations on <link> tags for this purpose, but below are a couple simple examples.
* For an Atom feed:
<link rel="alternate" type="application/atom+xml" title="Your Feed Title" href="http://www.example.com/atom.xml" />
* For an RSS feed:
<link rel="alternate" type="application/rss+xml" title="Your Feed Title" href="http://www.example.com/rss.xml" />
Source - www.google.com
Nice information you collected there, at fiirst I thought it was your articles.
We can think it as an article but when i fetched the data from Google i dont want to Take the Whole Credit
Yeah, wow, i never knew about these web crawlers... After reading this i was like "OH! thats how they do it!" lol...thanks.. keep up the good work
Page generated in 1,394,330,317.40398 seconds with 19 queries