-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading is significantly broken #106
Comments
Block of code mentioned above in fact has two issues:
I tried to replace it with something else:
It seems better, but I'm pretty sure we still have issues with the Multithreading is always a very complex thing to implement. I've isolated #100 but I'm pretty sure we have other locks missing in our code, because even if a list/set/dictionary is thread-safe, it does not means that we do not have some portions of our code where we need locking, e.g. when checking that an item is not already in a list and then adding it to the list if it was missing. It might be a bad idea, but at this point I suggest that we get rid of the multithreading in this (all?) scraper for now. We do not have the resources to set it up properly or maintain it in the long run. At least it is not ok now, and I always failed in such endeavor even if at first I always believed it was going to be easy. It makes logs very difficult to read because we never know which thread/process is "speaking". It does not help to have linear and predictable performances. Zimit and iFixit do not have any multithreading. They are known to be quite "slow". But at least we know their behavior is more predictable, and code is easier to maintain. WDYT? |
Concurrency is complicated and makes maintenance and debugging difficult. That's a fact. We deemed it necessary, knowingly, for performance reasons. We use some in almost all scrapers. Discarding it because it makes understanding/fixing an issue seems unreasonable. This scraper (and others) were running for a long time. If I look at zimfarm runs, I see libretext kept working and khan-fr wasn't run since 11m ago, only to be run and failed from 3m ago (video-encoding-related). Between those dates, a lot of changes happened to the scraper: bootrap, scraperlib, video encoding, video concurrency… If that helps, get rid of concurrency locally to ensure the rest works as expected then bring it back properly. But to remove such a feature in a scraper (or all), we'd need a lot more arguments. |
Doesn't prove it was working properly. libretext has no video encoding, so sure it works properly. The log of khan-fr has already a multithreading issue, whole log is gone but what's left in scraper stdout on https://farm.openzim.org/pipeline/4b5053ff-0ed4-4433-8217-119a5d8ae7d7/debug shows that there have been multiple exceptions in video processing which did not stopped the scraper.
You made a point 🤣
I don't feel like it is necessary, but thank you! It is just that we need to all be aware that I will spend time to fix some things now, and other bugs will probably arise in the future. But I'm committed to do by best to avoid/fix as many as possible! (PR for the bugs I'm aware of is almost ready indeed) |
Understandable ; fortunately we only have a few kolibri recipes for now so it will be easy to spot regressions |
I'm finally sorry to say that there is no PR ready yet, I do not achieve to make multiprocessing work as expected:
I've reproduced the problem in an isolated test case
For now I consider that At this stage, I recommend to invest some development days to migrate to I think that for the coming weeks / months we can live with this "bug", it seems to mainly mean that the scraper will not stop on first task failure ... but it probably has always been so. |
What we need has in fact already been implemented in https://github.com/openzim/sotoki/commits/main/src/sotoki/utils/executor.py, which has then been reused in wikihow and ifixit (at least). This executor model should probably be shared at python-scraperlib level. |
We are supposed to wait for all video + nodes processing, and return on first exception.
Log of last khan_academy_fr as reported in #100 shows that multiple nodes had to fail before the scraper stopped.
I tried to debug / fix the issue locally but it is just pure non-sense for now, there is probably a bug somewhere but I do not achieve to understand where.
The text was updated successfully, but these errors were encountered: