Ani-Spider is a web scraping tool built using Scrapy, a powerful and flexible web crawling and web scraping framework for Python. This project is designed to scrape data from the MyAnimeList website efficiently.
- Features
- Requirements
- Installation
- Usage
- Project Structure
- Contributing
- Dataset
- License
- Acknowledgments
- Efficient Web Scraping: Utilizes Scrapy's features to efficiently scrape data from websites.
- Customizable: Easily configurable settings for tailored scraping needs.
- Extensible: Can be extended with custom spiders for specific websites.
- Python 3.x
- Scrapy
-
Clone the repository:
git clone https://github.com/Mridul-23/Ani-Spider.git cd Ani-Spider
-
Create a virtual environment and activate it:
python -m venv env source env/bin/activate # On Windows use `env\Scripts\activate`
-
Install the required dependencies:
pip install scrapy
-
Navigate to the project directory:
cd ani_spider
-
Run the spider:
scrapy crawl <spider_name>
Replace <spider_name>
with the name of the spider you wish to run.
web_crawler.py
: Crawls anime data such as name, stats, and other details from the MyAnimeList website.img_crawler.py
: Crawls anime image links along with names from the MyAnimeList website.
- Uncomment the relevant items in
items.py
according to the data being crawled. This ensures that the scraped data is properly structured and stored.
ani_spider/
: Contains the main Scrapy project.spiders/
: Directory to store the spiders.web_crawler.py
: Spider for crawling anime data.img_crawler.py
: Spider for crawling anime images.
items.py
: Defines the data structures for the scraped items.pipelines.py
: Contains logic for data cleaning.middlewares.py
: Contains custom middleware for convenient scraping.settings.py
: Configuration settings for the Scrapy project.
scrapy.cfg
: The project configuration file.
Contributions are welcome! Please fork the repository and submit a pull request for any features, improvements, or bug fixes.
If you are interested in the datasets that are fetched and preprocessed from this spider then here's the Dataset.
This project is licensed under the MIT License. See the LICENSE file for details.
- Scrapy - An open-source and collaborative web crawling framework for Python.