Skip to content

Latest commit

 

History

History
74 lines (61 loc) · 7.43 KB

CommonCrawl.md

File metadata and controls

74 lines (61 loc) · 7.43 KB

Common Crawl

Common Crawl is a non-profit organization that maintains a large index of the web that is updated on a bi-monthly basis and freely available.

General

Tooling

  • cdx_toolkit - Star: 127 - Updated: 3/2022 - Checked: 5/2023 - "a set of tools for working with CDX indices of web crawls and archives, including those at CommonCrawl and the Internet Archive's Wayback Machine."
  • rokasramas' fork of comcrawl - Stars: 0 - Updated: 4/2020 - Checked: 5/2023 - Includes a fix that hasn't been applied to the original comcrawl library that allows it to work.
  • getallurls - Stars: 2.8k - Updated: 2/2023 - Checked: 5/2023 - Can fetch urls from Common Crawl as well as Open Threat Exchange, the Wayback Machine, and URLScan.
  • CommonCrawlDocumentDownload - Stars: 50 - Updated: 4/2023 - Checked: 5/2023 - Downloads documents by file/mime type from CC.
  • WARCannon - Stars: 212 - Updated: 9/2022 - Checked: 5/2023 - Uses AWS to at scale search Common Crawl data with regex patterns.

Other

  • NewsFetch - Stars: 13 - Updated: 10/2022 - Checked: 5/2023 - Can fetch news articles from the Common Crawl API.
  • news-please - Stars: 1.6k - Updated: 4/2023 - Checked: 5/2023 - Along with significant other functionality it can fetch articles from Common Crawl.
  • PWA Store - Stars: 5 - Updated: 9/2022 - Checked: 5/2023 - Uses Common Crawl and EMR to find as many PWA apps on the web as possible.

What Is?

Tutorials

General

AWS Athena

AWS EMR

AWS Lambda

Snowflake

Basic

Other

  • Colin Dellow. S3 Throughput: Scans vs Indexes. 2/2020.
    • Is it fasters to scan entire WARC files and attempt to pull just the data required from each WARC file utilizing the index?