Common Crawl

Common Crawl is a non-profit organization that maintains a large index of the web that is updated on a bi-monthly basis and freely available.

General

Official Site: https://commoncrawl.org/
Common Crawl Index Server: https://index.commoncrawl.org/
GitHub Repositories: https://github.com/commoncrawl - A few of the repositories are listed below, but there are many more.
- Common Crawl WARC Examples - "This repository contains both wrappers for processing WARC files in Hadoop MapReduce jobs and also Hadoop examples to get you started."
- Jupyter Notebooks to Analyze Common Crawl Data - This includes several different notebooks, some may be especially interested in running a notebook on AWS EMR.
- Common Crawl PySpark Examples - "This project provides examples [of] how to process the Common Crawl dataset with Apache Spark and Python".
- Common Crawl Index Server - "This project is a deployment of the pywb web archive replay and index server to provide an index query mechanism for datasets provided by Common Crawl".

Tooling

cdx_toolkit - Star: 127 - Updated: 3/2022 - Checked: 5/2023 - "a set of tools for working with CDX indices of web crawls and archives, including those at CommonCrawl and the Internet Archive's Wayback Machine."
rokasramas' fork of comcrawl - Stars: 0 - Updated: 4/2020 - Checked: 5/2023 - Includes a fix that hasn't been applied to the original comcrawl library that allows it to work.
getallurls - Stars: 2.8k - Updated: 2/2023 - Checked: 5/2023 - Can fetch urls from Common Crawl as well as Open Threat Exchange, the Wayback Machine, and URLScan.
CommonCrawlDocumentDownload - Stars: 50 - Updated: 4/2023 - Checked: 5/2023 - Downloads documents by file/mime type from CC.
WARCannon - Stars: 212 - Updated: 9/2022 - Checked: 5/2023 - Uses AWS to at scale search Common Crawl data with regex patterns.

Other

NewsFetch - Stars: 13 - Updated: 10/2022 - Checked: 5/2023 - Can fetch news articles from the Common Crawl API.
news-please - Stars: 1.6k - Updated: 4/2023 - Checked: 5/2023 - Along with significant other functionality it can fetch articles from Common Crawl.
PWA Store - Stars: 5 - Updated: 9/2022 - Checked: 5/2023 - Uses Common Crawl and EMR to find as many PWA apps on the web as possible.

What Is?

C4 Dataset - Text data extracted from Common Crawl.
- https://github.com/shjwudp/c4-dataset-script
CDX - Capture/Crawl inDeX - Standard index format for WARCs.

Tutorials

General

Edward Ross. CommonCrawl Category. skeptric.
- Ross has published a number of well-written articles on Common Crawl. A great place to start if you are looking to go through the basics and beyond.
- Searching 100 Billion Webpages With Capture Index. 6/2020.
  - Explains how to use the web interface (slow) as well as the CDX Toolkit, comcrawl, and directly in Python without using a custom CommonCrawl library. Unfortunately both comcrawl and the CDX Toolkit require some tweaks to get running.
- Read Commonm Crawl Parquet Metadata with Python. 4/2022.
  - Covers reading Parquet metadata using PyArrow, fastparquet, manually (in Python), and using asyncio to speed things up.
CommonCrawl.org So you're ready to get started.
- Covers a lot of ground, perhaps not the best for true beginners. Covers data locations, file formats (WARC, WAT, WET), indexes, as well as processing the files.
CommonCrawl.org Examples using Common Crawl Data.
- Unfortunately the vast majority of the examples available here are quite old.

AWS Athena

Sebastian Nagel. Index to WARC Files and URLs in Columnar Format. commoncrawl, 3/2018.
Stanislas Girard. Parse Petabytes of data from CommonCrawl in seconds. primates.dev, 1/2020.
- Simple and straightforward, short, fairly basic, but good place to start.
Athul Jayson. Extracting Data from Common Crawl Dataset. qburst, 7/2020.
- Also has an associated GitHub repository.
Ryan Elkins. Search the html across 25 billion websites for passive reconnaissance using common crawl. 7/2020.
- While written from a security perspective it provides solid guidance to using AWS Athena with Common Crawl. It also utilizes Amazon SageMaker, S3, and AWS IAM. There is an associated repo.

AWS EMR

Basil Latif. Measuring Internet Links: Accessing the Common Crawl Dataset Using EMR and Pyspark in AWS. 6/2020.
Common Crawl EMR Tutorial - Stars: 9 - Updated: 3/2021 - Checked: 5/2023 - "This guide walks you through submitting a Scala Spark application to EMR that queries 500k job urls from Common Crawl and saves the results to an S3 bucket in CSV format."

AWS Lambda

Chris Madden, Aaron Bawcom. Analyzing Performance and Cost of Large-Scale Data Processing with AWS Lambda. 6/2019.
- Covers the high-level process with associated GitHub repository.
Jader Dias. One-click to download all the web pages you may want. 6/2022.
- Builds on using Athena to get data from Common Crawl and AWS Lambda to download it.

Snowflake

Venkat Sekar. Querying TB sized External Tables with Snowflake. 2/1/2022.

Basic

David Mackey. Basic Information About CommonCrawl. 5/2023.
David Mackey. How To Manually Access CommonCrawl. 5/2023.
Chillar Anand. Common Crawl On Laptop - Extracting Subset of Data. 11/2022.

Other

Colin Dellow. S3 Throughput: Scans vs Indexes. 2/2020.
- Is it fasters to scan entire WARC files and attempt to pull just the data required from each WARC file utilizing the index?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CommonCrawl.md

CommonCrawl.md

Common Crawl

General

Tooling

Other

What Is?

Tutorials

General

AWS Athena

AWS EMR

AWS Lambda

Snowflake

Basic

Other

Files

CommonCrawl.md

Latest commit

History

CommonCrawl.md

File metadata and controls

Common Crawl

General

Tooling

Other

What Is?

Tutorials

General

AWS Athena

AWS EMR

AWS Lambda

Snowflake

Basic

Other