Skip to content

Latest commit

 

History

History
18 lines (14 loc) · 1.35 KB

file-statistics.md

File metadata and controls

18 lines (14 loc) · 1.35 KB
layout table_content_length_all table_content_length_curlie table_num_lines_all table_num_lines_curlie table_num_user_agents_all table_num_user_agents_curlie
basic
tables/basic-content-length-all.html
tables/basic-content-length-curlie.html
tables/basic-num-lines-all.html
tables/basic-num-lines-curlie.html
tables/basic-num-user-agents-all.html
tables/basic-num-user-agents-curlie.html

File statistics

The following tables present basic statistics on the collected robots.txt files and their development over the years. The statistics are furthermore aggregated for 16 website categories. For categorizing the websites, we employed the Curlie top level label (Example: https://cnn.com/robots.txt -> News). Note that the human-curated, filtered Curlie directory contains links to less than one million of hosts and consequently most robots.txt files remain unlabeled.

Plot of bias among user agents

We furthermore looked at the distribution of files sizes and the number of lines in the robots.txt files, yielding an interesting insight. The usage of some file templates of Content Management Systems, such as wix.com or WordPress, are so popular that they lead to significant peaks in the length distribution.