- Comments
- General Resources
- Apache Nutch
- StormCrawler
- Scrapy
- Norconex Web Crawler
- PulsarR
- Heritrix
- Sparkler
- CoCrawler
- Comparisons
- Other
- Maybe...?
- This page focuses on web crawlers/spiders as opposed to web scrapers. While there can be significant overlap between the two, our goal is to evaluate systems that are meant for web scale crawling.
- This document focuses on general purpose web crawlers. There is a growing niche of crawlers created specifically for security purposes which are not covered here.
- We focus primarily on projects which are being actively developed. Projects which are showing limited signs of life may not be included. If you feel we've passed over a project that should be included, please create an issue or pull request.
- Awesome Crawler - Stars: 5.5k - Updated: 12/2022 - Checked: 4/2023.
- https://nutch.apache.org/
- GitHub Repo
- Stars: 2.6k - Updated: 3/2023 - Checked: 4/2023.
- Probably the best known and most utilized open source web crawler.
- Nutch Tutorial - The official tutorial for getting started with Nutch.
- http://stormcrawler.net/index.html
- GitHub Repo
- Stars: 795 - Updated: 4/2023 - Checked: 4/2023.
- Open source web crawler built on Apache Storm.
- OpenWebSearch.eu's Owler web crawler is built off of StormCrawler.
- https://scrapy.org/
- GitHub Repo
- Stars: 9.9k - Updated: 4/2023 - Checked: 4/2023.
- A popular, open source web crawler/scraper written in Python.
- Scrapy Documentation on Broad Crawls.
- WebScraping API's Web Crawling With Python. 12/2022.
- https://opensource.norconex.com/crawlers/web/
- GitHub Repo
- Stars: 153 - Updated: 2/2023 - Checked: 4/2023.
- Open source Java web crawler.
- https://github.com/platonai/pulsarr
- Open source web crawler written in Kotlin.
- https://heritrix.readthedocs.io/en/latest/
- GitHub Repo
- Stars: 2.4k - Updated: 3/2023 - Checked: 4/2023.
- Open source web crawler written in Java by the Internet Archive.
- See also Internet Archive's browser-based distributed crawler, brozzler.
- http://irds.usc.edu/sparkler/
- GitHub Repo
- Stars: 400 - Updated: 4/2023 - Checked: 4/2023.
- A next-generation successor to Apache Nutch that uses Spark, Kafka, Lucene/Solr, Tika, and pf4j.
- GitHub Repo
- Stars: 166 - Updated: 4/2022 - Checked: 4/2023.
- Authored by Greg Lindahl (Blekko) in Python, pre-release.
- Included primarily because Lindahl has a proven track record in web crawling.
- Rody. Comparison of Open Source Web Crawlers for Data Mining and Web Scraping: Pros & Cons. outsourceit.today, 10/2022.
- Covers Scrapy, Heritrix, Nutch, and PySpider.
- Crawlab - Stars: 9.7k - Updated: 4/2023 - Checked: 4/2023.
- A Go language, distributed web crawler admin platform that works with multiple languages and frameworks including Scrapy.
- NOTE: Does not appear to have integrations with most web scale crawlers, e.g. Nutch or StormCrawler.
-
This section includes a few crawlers that are in development and show some promise.
-
Crawler - Stars: 233 - Updated: 3/2023 - Checked: 4/2023.
- Including this one because it's written in PHP, which isn't particularly common for web crawlers.
-
SeimiCrawler - Stars: 1.9k - Updateds: 4/2023 - Checked: 4/2023.
- A Java-based, distributed, open source web crawler.
-
XXL-CRAWLER - Stars: 654 - Updated: 10/2022 - Checked: 4/2023.
- A Java-based, distributed, open source web crawler.
-
Sparkler-Crawler - Stars: 400 - Updated: 4/2023 - Checked: 4/2023.
- A Java/Scala based web crawler built on Spark.
-
crawler - Stars: 22 - Updated: 4/2023 - Checked: 4/2023.
- A Rust, open source web crawler that claims it is "capable of handling millions of pages per second efficiently."
-
colly - Stars: 19.3k - Updated: 4/2023 - Checked: 4/2023.
- A Go language open source frmework for building crawlers/scrapers/spiders.
-
Montferret - Stars: 5.3k - Updated: 4/2023 - Checked: 4/2023.
- A Go language, open source web scraper. Letting it slide in for its interesting declarative approach.