-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make_wikipedia.py: long running time #121
Comments
mmmh I have not tried to process DE Wikipedia in a while, but when I did it last year I was not having the same issue. I've heard good things about the MediaWiki Parser from Hell, which is luckily still in active development. |
Thank you for the tip! In the meanwhile I have discovered a pre-parsed dataset on huggingface hub: wikimedia/wikipedia. They also seem to use this parser, so I will try using this for now. |
It took a very long time to process the dataset, my computer has 16 virtual processors as follows: |
Hi, Thank you for sharing this outstanding repository!
I have been trying to use
scripts/make_wikipedia_py
to process a German wikipedia dump:Unfortunately, it has been running for several days and judging from the outputs it seems to have made only little progress if I interpret the output correctly:
At this speed, it would take weeks to complete. Using htop I can see that all processes are busy, so I don't think that this is a multiprocessing problem (#58), however, I am also running it on a Linux machine.
This is likely a problem of the underlying
wikiextractor
library, but since there seems to be little to no activity and I am interested in your experience of using this script. Is it normal for this to take so long?The text was updated successfully, but these errors were encountered: