Skip to content

Latest commit

 

History

History
102 lines (87 loc) · 5.18 KB

WebCrawlers.md

File metadata and controls

102 lines (87 loc) · 5.18 KB

Open Source Web Crawlers

Table of Contents

  • Comments
  • General Resources
  • Apache Nutch
  • StormCrawler
  • Scrapy
  • Norconex Web Crawler
  • PulsarR
  • Heritrix
  • Sparkler
  • CoCrawler
  • Comparisons
  • Other
  • Maybe...?

Comments

  • This page focuses on web crawlers/spiders as opposed to web scrapers. While there can be significant overlap between the two, our goal is to evaluate systems that are meant for web scale crawling.
  • This document focuses on general purpose web crawlers. There is a growing niche of crawlers created specifically for security purposes which are not covered here.
  • We focus primarily on projects which are being actively developed. Projects which are showing limited signs of life may not be included. If you feel we've passed over a project that should be included, please create an issue or pull request.

General Resources

Apache Nutch

StormCrawler

Scrapy

Norconex Web Crawler

PulsarR

Heritrix

Sparkler

CoCrawler

  • GitHub Repo
    • Stars: 166 - Updated: 4/2022 - Checked: 4/2023.
  • Authored by Greg Lindahl (Blekko) in Python, pre-release.
    • Included primarily because Lindahl has a proven track record in web crawling.

Comparisons

Other

  • Crawlab - Stars: 9.7k - Updated: 4/2023 - Checked: 4/2023.
    • A Go language, distributed web crawler admin platform that works with multiple languages and frameworks including Scrapy.
    • NOTE: Does not appear to have integrations with most web scale crawlers, e.g. Nutch or StormCrawler.

Maybe...?

  • This section includes a few crawlers that are in development and show some promise.

  • Crawler - Stars: 233 - Updated: 3/2023 - Checked: 4/2023.

    • Including this one because it's written in PHP, which isn't particularly common for web crawlers.
  • SeimiCrawler - Stars: 1.9k - Updateds: 4/2023 - Checked: 4/2023.

    • A Java-based, distributed, open source web crawler.
  • XXL-CRAWLER - Stars: 654 - Updated: 10/2022 - Checked: 4/2023.

    • A Java-based, distributed, open source web crawler.
  • Sparkler-Crawler - Stars: 400 - Updated: 4/2023 - Checked: 4/2023.

    • A Java/Scala based web crawler built on Spark.
  • crawler - Stars: 22 - Updated: 4/2023 - Checked: 4/2023.

    • A Rust, open source web crawler that claims it is "capable of handling millions of pages per second efficiently."
  • colly - Stars: 19.3k - Updated: 4/2023 - Checked: 4/2023.

    • A Go language open source frmework for building crawlers/scrapers/spiders.
  • Montferret - Stars: 5.3k - Updated: 4/2023 - Checked: 4/2023.

    • A Go language, open source web scraper. Letting it slide in for its interesting declarative approach.