A web robot, commonly referred to as a bot, is any non-human web user. They are usually scripts or computer programs that access web pages for a variety of different reasons. The most familiar bot is Googlebot, which accesses and analyses web pages for inclusion in Google’s search index. Most site owners want Googlebot to come calling, but there are also many bots with less friendly intent, including those that visit a site and attempt to hack it or scrape the content for use on other sites.
In fact, a significant majority of web traffic is generated by bots rather than humans. Towards the end of 2013, Incapsula estimated that about 61.5 percent of all web traffic was bots.
The Robots Exclusion Protocol, also known as the robots.txt protocol, is a way for site owners to communicate with bots. Specifically it is a way tell bots which files and directories on a site they are allowed to access and even whether they are allowed to access the site at all.
It’s important to note that abiding by the contents of a robots.txt file is entirely voluntary: malicious bots will ignore it, and good bots, including Googlebot and the crawlers of most search engines, will follow the directives.
The robots.txt file is placed in the root directory of domain, and can contain a number of directives of the format:
This is the simplest possible robots.txt file. In effect it does nothing: it allows all bots to access all files on the server, which is just as if there were no robots.txt at all.
The following does exactly the opposite: banning all compliant bots from looking at anything on the domain — keep in mind, if your robots.txt says this, then Googlebot will not crawl your site.
As you can see, robots.txt specifies a user-agent, a string that matches to the “name” of a bot such as Googlebot, followed by a list of “Disallow” instructions, each of which refers to a file or directory. This is useful because sometimes we want a bot to crawl some parts of our site, but not others. For example, a WordPress site might not want Googlebot to crawl its wp-admin directory, in which case, its robots.txt file would include the following:
This is just an example; a full robots.txt contains a lot more instructions. If you want the full details on setting up a robots.txt for WordPress, take a look at this page from WordPress or for an alternative perspective, this page from Yoast.
The Robots Exclusion Protocol contains more than just instructions for blocking bots; it also has directives for site map location, crawl frequency, and other, less used, directives. The Wikipedia page contains a full list.
Robots.txt is important for site owners because it can impact how their site performs in search engines, so if you’re building a site, understanding robots.txt should be a priority.
There is also a humans.txt
Humans.txt is not really analogous to robots.txt, but not many people know about it, so it’s worth a mention. Humans.txt is intended as a standard location for giving credit to the people and technology behind a site. It can include things like the name of the site owner and their social media details, a list of people that the site owner wants to thank, and the components of the site, such as jQuery, Modernizr, and so on. Google has a very simple humans.txt, and the Spanish web developer Abel Calbans has a fuller example.