layout | table_include |
---|---|
table |
tables/overview.html |
Since 2016, Common Crawl regularly publishes the robots.txt files that have been fetched during the CCBot's web crawl. The robots.txt dumps are published along with the regular WARC, WAT and WET files in intervals of approximately two to three months. We have parsed the last robots.txt dumps of each year since 2016, resulting in eight years of collected statistics.
- File statistics - Average content length (file size), Average number of lines and user agents
- Top user agents - Most frequently mentioned agent names
- User agent bias - Number of disallow all instructions per user agent
- Resources - Dataset of extracted links to valid robots.txt files and sitemaps
The following table outlines each year together with the period, in which the robots.txt were fetched (capture time). As the dumps also contain unsuccessful fetches (e.g. HTTP status code 404) and unparsable files, the table also yields the total number of successfully parsed robots.txt files and gives an estimation of the adoption rate of robots.txt among websites (to be more precise, hosts).