wiki_data_dump

A tool for traversing and downloading from Wikimedia Data Dumps and their mirrors.

Purpose

To make the maintenance of large wiki datasets easier and more stable.

In addition, the purpose is to lighten the load on Wikimedia and its mirrors by accessing only the index of the site, and doing the inevitable searching and navigation of its contents offline.

A web crawler might make multiple requests to find its file (in addition to navigating with the notorious fragility of a web crawler), while wiki_data_dump caches the site's contents - which not only provides a speed boost for multiple uses of the library but protects against accidentally flooding Wikimedia with requests by not relying on requests for site navigation.

Installation

pip install wiki_data_dump

Usage

One could easily get all available job names for any given wiki with this short script:

from wiki_data_dump import WikiDump, Wiki

wiki = WikiDump()
en_wiki: Wiki = wiki.get_wiki('enwiki')

print(en_wiki.jobs.keys())

Or, you could see the available files from the categorytables sql job.

from wiki_data_dump import WikiDump, Job

wiki = WikiDump()
categories: Job = wiki.get_job("enwiki", "categorytables")

print(categories.files.keys())

A slightly more nontrivial example - querying for specific file types when a job may contain more files than we need.

For example, it's not uncommon to find a job that has partial data dumps - making it necessary to know the file paths of all parts. If you're hard-coding all the file names, it becomes increasingly difficult to find the relevant files.

This is a solution that wiki_data_dump provides:

from wiki_data_dump import WikiDump, File
import re
from typing import List

wiki = WikiDump()

xml_stubs_dump_job = wiki["enwiki", "xmlstubsdump"]

stub_history_files: List[File] = xml_stubs_dump_job.get_files(
    re.compile(r"stub-meta-history[0-9]+\.xml\.gz$")
)

for file in stub_history_files:
    wiki.download(file).join()

Download processes are threaded by default, and the call to WikiDump.download returns a reference to the thread it's running in.

The process is simple and readable:

Get the job that contains the files desired.
Filter the files to only contain those that you need.
Download the files concurrently (or in parallel).

For more direction on how to use this library, see tests.py or examples in examples.

Next steps

Automatic detection of which mirror has the fastest download speed at any given time.
Caching that updates only when a resource is out of date, instead of just when the current date has passed the cache's creation date.
The ability to access Wikimedia downloads available in /other/.

Support

If this product has helped you, I'm a hobbyist so any support would be appreciated but most certainly not required. :)

Disclaimer

The author of this software is not affiliated, associated, authorized, endorsed by, or in any way officially connected with Wikimedia or any of its affiliates and is independently owned and created.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
examples		examples
test_data		test_data
wiki_data_dump		wiki_data_dump
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiki_data_dump

Purpose

Installation

Usage

Next steps

Support

Disclaimer

About

Releases

Packages

Languages

License

jon-edward/wiki_dump

Folders and files

Latest commit

History

Repository files navigation

wiki_data_dump

Purpose

Installation

Usage

Next steps

Support

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages