-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zimit refuses to Zim AnandTech. #438
Comments
The problem is that the tool looks only at the first 1024 bytes of each HTML document to find its encoding. Typical recommendation is to put the We have an advanced switch to control this: https://github.com/openzim/warc2zim/blob/main/src/warc2zim/main.py#L129-L134 Unfortunately, this switch is not exposed in zimit.kiwix.org because it is deemed way too complex for normal user and very risky for our worker should someone abuse it (reading too much data is going to make this process consumes lots of memory / be too slow). Note that in 2 hours only 1434 pages have been crawled out of 20420 already found. So anyway the ZIM is going to be far from complete. Would you be interested in purchasing more time to complete the ZIM? |
@benoit74 I would not mind purchasing some more time given the price is right, but I also am interested in running it locally. Would a quad core Haswell chip, 16 gigs of RAM, and 256 gigs of free storage (1TB SSD) suffice? How long would that take anyway? |
@Popolechien do we have "official" prices for that?
CPU and RAM are more than enough. No certainty regarding the total disk size since I have no idea how big the website is (you have to account for about twice the final ZIM size, we have to store a temporary file before creating the ZIM which is about the same size as the ZIM). But from what I saw, it seems definitely feasible.
Rule of thumb is ~ 5 secs per page. Here the scraper has already found 20k pages. Maybe it will find more while exploring, but let's start with this number, it gives us 100k secs, or ~ 28 hours. You can increase the parallelism with |
no, because we don't know in advance what the final size of the zim file will be. We could guesstimate hardware/storage costs by averaging the size of our regular zimit files, and add some labour costs. This however would only be valid for zim files that can be produced off-the-shelf, meaning without any additional tinkering of the parameters or codefix on our part. Back-of-the-enveloppe calculations say we should be anywhere between $30-100, and ideally this should be done via a portal and minimal operational involvement on our part. But seeing how many projects we have currently ongoing I would be loathe to commit to yet another one for the time being. |
Seeing as AnandTech has shut down (and I don't really trust Future to be true to their word and keep it up forever) I decided I wanted to Zimit, seeing as it is a treasure trove of information. That did not work, I tried twice, and I got the same error:
sizeLimit is 4294967296 and timeLimit is 7200 if that helps. I have attached a screenshot of my second attempt:
The text was updated successfully, but these errors were encountered: