Google Kicks Off July by Finally Standardizing Robots.txt After 25 Years

Google started off July with several announcements related to the Robots Exclusion Protocol (commonly referred to as REP or colloquially as robots.txt after the original means of implementing it). We’ll give you a brief summary of what it is and explain Google’s various announcements all related to it.

A couple of years ago, we wrote about the Robots Exclusion Protocol (REP) in detail, which defines a method to prescribe which parts of your website you do not want automated scripts (called robots, spiders, or crawlers depending on the context) to access. The protocol originated in 1994 when Martijn Koster drafted a document outlining how to use a plain text file (named robots.txt) to instruct web crawlers which parts of a website should not be accessed. The original website also introduced the concept of using meta tags to accomplish the same thing later.

Standardizing the Protocol

While search engines have long adopted this protocol, neither the Internet Engineering Task Force (IETF) nor the World Wide Web Consortium (W3C) have never formally adopted them as an internet standard. This left the rather vague protocol up for a lot of interpretation over the years, and not every search engine interpreted the standards the same way (or even followed it at all in some cases). In addition, while Koster’s original document has remained largely unchanged for 25 years, the internet has changed dramatically in that same period.

Against this backdrop, Google apparently decided the time was right to standardize the Robots Exclusion Protocol so that it can be implemented consistently going forward. On Monday, July 1, in partnership with Koster (the original author of the protocol), Google announced that it had submitted a working draft to the IETF to standardize and extend the Robots Exclusion Protocol with enhancements for the modern web. Here are a few of the proposed enhancements:

Defining how and when crawlers can cache the robots.txt file.
Allowing crawlers to define a maximum size for the file, but insisting that the size must be at least 500 KiB
Officially adopting the “Allow:” directive that Google and others have followed for years despite not being part of the original specification.
Indicating how wildcards and flags should be used, and indicating how to deal with encoded characters.
Indicating that robots.txt can be used on any protocol based on the concept of a Uniform Resource Identifier (URI), rather than working only on HTTP.

This particular working draft refers only to robots.txt; it does not mention meta tags or other methods of robots exclusion. When compared against the original specification, the only new directive that will be standardized is the Allow: directive, which joins the pre-existing User-agent: and Disallow: directives mentioned in the original document. Many search engine spiders, including Google’s own Googlebot have long supported Allow:, which acts as an inverse to Disallow:, allowing you to indicate URLs that specifically should be allowed to be indexed, which is useful when you want to block a pattern of URLs while still allowing a specific URL that would otherwise match the pattern.

While the draft doesn’t standardize any of the other directives commonly included in robots.txt files, it does allow an extension mechanism, through which individual crawlers can add additional directives that are not part of the standard. By way of example, the Sitemap: directive outlined by the XML Sitemaps protocol is mentioned.

The new draft standard is in its early stage at the IETF, and Google stated in a blog post that it welcomes comments as the draft is revised.

Open Sourcing the Google robots.txt Parser

Once the protocol is standardized, it will be important that web crawlers implement the standard reliably. This will mean work for every company that makes a spider that crawls the web. To ease this transition, on the same day that Google announced that they were spearheading the effort to standardize the protocol, they also announced that they are open sourcing Google’s own robots.txt parser that it has used for over 20 years.

This library was written in the C++ programming language and according to the blog post, some pieces of the code are over 20 years old (though other pieces are much newer). The parser is available under an Apache 2.0 License and published for free on Github. The Apache 2.0 License allows both commercial and non-commercial use and allows modification by any party, making for a very permissive license.

Google has opened up the repo to pull requests as well, and third parties have already started submitting a number of changes to the parser. It is unclear from Google’s announcement whether changes made by third parties to the open source project will in turn be adopted by Google’s own crawler, Googlebot, but if they do, keeping track of this open source project may be a good way of knowing just how Google treats the robots.txt file.

Cleaning up the Googlebot Parser’s Use of Unsupported Directives

In the wake of standardizing the protocol and open sourcing its own parser, Google has announced that it is officially removing support for unsupported rules from Googlebot. In particular, Google has indicated that it is retiring the Crawl-delay:, Nofollow:, and Noindex: directives from Googlebot’s robots.txt parser effective September 1.

Keep in mind this is specifically about the robots.txt parser; the noindex and nofollow directives in meta tags and HTTP headers are still supported. Google never documented their support for Noindex: in robots.txt files, as we mentioned in our 2017 post, but it was an effective means of instructing Googlebot not to index certain URLs until now.

Google made these changes to the parser before open sourcing it, and as such the open source code contains no references to the Crawl-delay:, Nofollow:, or Noindex: directives.

Wrapping Up

Google’s Webmaster Central team was certainly busy the first week of July. In two days they published three blog posts, an IETF Working Draft, and an open source Github repository containing two decades of code. This standardization of the Robots Exclusion Protocol is clearly something Google feels very strongly about, which makes sense. Defining what a crawler can and cannot access on websites is fundamental to your site being indexed. We recommend you or your webmaster takes a look at your robots.txt file and makes sure that your directives are following the new draft standards, and be sure to keep an eye out on the draft as it evolves. We here at Justia are following these developments closely as always.