How does a company like Google index the Internet? It uses little applications called "bots" to sniff their way around the internet. Like a young adventurous pet, from time to time they can dig a little too deep and expose more web pages than desired. For instance, it might expose a special admin web page or a temporary directory not intended for public viewing. In the mid-1990s, Martijn Koster invented the concept of a robots.txt file. This simple text file was placed on the root of a website and provided special instructions for bots. An example is provided below:
The concept of this file is very simple. Provide a list of folders and files to be excluded from searching. (For more information about how to build a robots file visit robotstxt.org.) Over time all search engines were updated to abide by these new rules. This was an excellent, simple way to remove unwanted results from search engines. It did not require any programming or complex setup to implement.
Unfortunately, over the past year Google decided to partially ignore the robots.txt file in an attempt to increase the effectiveness of its results. The following is a direct quote from Google's webmaster documentation about how they interrupt a robots.txt file:
This minor cliff note means web pages previously excluded from Google's search results may start showing up again. In this circumstance, the search results will display the title, URL, and the message "A description for this result is not available because of the site's robots.txt - learn more".
If Google isn't completely respecting the robots.txt file anymore, what other options are available? Acceptable alternatives include the use of a robots meta tag or HTTP header. Both options allow for the same directives and take the basic concept of a robots.txt file to the next level. The following is an example of a robots meta tag:
A complete list of options is available on Google's webmaster site. It includes specialized directives such as "noimageindex" which prevents the indexing of images on a web page and "notranslate" which disallows translation of a web page into other languages. Although this does provide more flexibility for web administrators, it requires more time and effort. These tags and/or headers need to be included by a programmer or through a custom configuration on a web server. Once included, Google will drop the offending web page from its search results completely.