layout | table_content_length_all | table_content_length_curlie | table_num_lines_all | table_num_lines_curlie | table_num_user_agents_all | table_num_user_agents_curlie |
---|---|---|---|---|---|---|
basic |
tables/basic-content-length-all.html |
tables/basic-content-length-curlie.html |
tables/basic-num-lines-all.html |
tables/basic-num-lines-curlie.html |
tables/basic-num-user-agents-all.html |
tables/basic-num-user-agents-curlie.html |
The following tables present basic statistics on the collected robots.txt files and their development over the years. The statistics are furthermore aggregated for 16 website categories. For categorizing the websites, we employed the Curlie top level label (Example: https://cnn.com/robots.txt -> News). Note that the human-curated, filtered Curlie directory contains links to less than one million of hosts and consequently most robots.txt files remain unlabeled.
We furthermore looked at the distribution of files sizes and the number of lines in the robots.txt files, yielding an interesting insight. The usage of some file templates of Content Management Systems, such as wix.com or WordPress, are so popular that they lead to significant peaks in the length distribution.