Open Source Web Crawlers

Comments

This page focuses on web crawlers/spiders as opposed to web scrapers. While there can be significant overlap between the two, our goal is to evaluate systems that are meant for web scale crawling.
This document focuses on general purpose web crawlers. There is a growing niche of crawlers created specifically for security purposes which are not covered here.
We focus primarily on projects which are being actively developed. Projects which are showing limited signs of life may not be included. If you feel we've passed over a project that should be included, please create an issue or pull request.

General Resources

Awesome Crawler - Stars: 5.5k - Updated: 12/2022 - Checked: 4/2023.

Apache Nutch

https://nutch.apache.org/
GitHub Repo
- Stars: 2.6k - Updated: 3/2023 - Checked: 4/2023.
Probably the best known and most utilized open source web crawler.
Nutch Tutorial - The official tutorial for getting started with Nutch.

StormCrawler

http://stormcrawler.net/index.html
GitHub Repo
- Stars: 795 - Updated: 4/2023 - Checked: 4/2023.
Open source web crawler built on Apache Storm.
OpenWebSearch.eu's Owler web crawler is built off of StormCrawler.

Scrapy

https://scrapy.org/
GitHub Repo
- Stars: 9.9k - Updated: 4/2023 - Checked: 4/2023.
A popular, open source web crawler/scraper written in Python.
Scrapy Documentation on Broad Crawls.
WebScraping API's Web Crawling With Python. 12/2022.

Norconex Web Crawler

https://opensource.norconex.com/crawlers/web/
GitHub Repo
- Stars: 153 - Updated: 2/2023 - Checked: 4/2023.
Open source Java web crawler.

PulsarR

https://github.com/platonai/pulsarr
Open source web crawler written in Kotlin.

Heritrix

https://heritrix.readthedocs.io/en/latest/
GitHub Repo
- Stars: 2.4k - Updated: 3/2023 - Checked: 4/2023.
Open source web crawler written in Java by the Internet Archive.
See also Internet Archive's browser-based distributed crawler, brozzler.

Sparkler

http://irds.usc.edu/sparkler/
GitHub Repo
- Stars: 400 - Updated: 4/2023 - Checked: 4/2023.
A next-generation successor to Apache Nutch that uses Spark, Kafka, Lucene/Solr, Tika, and pf4j.

CoCrawler

GitHub Repo
- Stars: 166 - Updated: 4/2022 - Checked: 4/2023.
Authored by Greg Lindahl (Blekko) in Python, pre-release.
- Included primarily because Lindahl has a proven track record in web crawling.

Comparisons

Rody. Comparison of Open Source Web Crawlers for Data Mining and Web Scraping: Pros & Cons. outsourceit.today, 10/2022.
- Covers Scrapy, Heritrix, Nutch, and PySpider.

Other

Crawlab - Stars: 9.7k - Updated: 4/2023 - Checked: 4/2023.
- A Go language, distributed web crawler admin platform that works with multiple languages and frameworks including Scrapy.
- NOTE: Does not appear to have integrations with most web scale crawlers, e.g. Nutch or StormCrawler.

Maybe...?

This section includes a few crawlers that are in development and show some promise.
Crawler - Stars: 233 - Updated: 3/2023 - Checked: 4/2023.
- Including this one because it's written in PHP, which isn't particularly common for web crawlers.
SeimiCrawler - Stars: 1.9k - Updateds: 4/2023 - Checked: 4/2023.
- A Java-based, distributed, open source web crawler.
XXL-CRAWLER - Stars: 654 - Updated: 10/2022 - Checked: 4/2023.
- A Java-based, distributed, open source web crawler.
Sparkler-Crawler - Stars: 400 - Updated: 4/2023 - Checked: 4/2023.
- A Java/Scala based web crawler built on Spark.
crawler - Stars: 22 - Updated: 4/2023 - Checked: 4/2023.
- A Rust, open source web crawler that claims it is "capable of handling millions of pages per second efficiently."
colly - Stars: 19.3k - Updated: 4/2023 - Checked: 4/2023.
- A Go language open source frmework for building crawlers/scrapers/spiders.
Montferret - Stars: 5.3k - Updated: 4/2023 - Checked: 4/2023.
- A Go language, open source web scraper. Letting it slide in for its interesting declarative approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebCrawlers.md

WebCrawlers.md

Open Source Web Crawlers

Table of Contents

Comments

General Resources

Apache Nutch

StormCrawler

Scrapy

Norconex Web Crawler

PulsarR

Heritrix

Sparkler

CoCrawler

Comparisons

Other

Maybe...?

Files

WebCrawlers.md

Latest commit

History

WebCrawlers.md

File metadata and controls

Open Source Web Crawlers

Table of Contents

Comments

General Resources

Apache Nutch

StormCrawler

Scrapy

Norconex Web Crawler

PulsarR

Heritrix

Sparkler

CoCrawler

Comparisons

Other

Maybe...?