Zimit refuses to Zim AnandTech. #438

Hacker-Anirudh · 2024-12-01T18:37:35Z

Seeing as AnandTech has shut down (and I don't really trust Future to be true to their word and keep it up forever) I decided I wanted to Zimit, seeing as it is a treasure trove of information. That did not work, I tried twice, and I got the same error:

sizeLimit is 4294967296 and timeLimit is 7200 if that helps. I have attached a screenshot of my second attempt:

benoit74 · 2024-12-02T15:31:43Z

The problem is that the tool looks only at the first 1024 bytes of each HTML document to find its encoding. Typical recommendation is to put the <meta http-equiv="Content-Type" content="text/html; charset=xxxxx"> at the very beginning of the document. This is not the case on https://www.anandtech.com/, were we need few more bytes.

We have an advanced switch to control this: https://github.com/openzim/warc2zim/blob/main/src/warc2zim/main.py#L129-L134

Unfortunately, this switch is not exposed in zimit.kiwix.org because it is deemed way too complex for normal user and very risky for our worker should someone abuse it (reading too much data is going to make this process consumes lots of memory / be too slow).

Note that in 2 hours only 1434 pages have been crawled out of 20420 already found. So anyway the ZIM is going to be far from complete. Would you be interested in purchasing more time to complete the ZIM?

Hacker-Anirudh · 2024-12-02T19:15:28Z

@benoit74 I would not mind purchasing some more time given the price is right, but I also am interested in running it locally. Would a quad core Haswell chip, 16 gigs of RAM, and 256 gigs of free storage (1TB SSD) suffice? How long would that take anyway?

benoit74 · 2024-12-02T19:32:52Z

I would not mind purchasing some more time given the price is right

@Popolechien do we have "official" prices for that?

Would a quad core Haswell chip, 16 gigs of RAM, and 256 gigs of free storage (1TB SSD) suffice?

CPU and RAM are more than enough. No certainty regarding the total disk size since I have no idea how big the website is (you have to account for about twice the final ZIM size, we have to store a temporary file before creating the ZIM which is about the same size as the ZIM). But from what I saw, it seems definitely feasible.

How long would that take anyway?

Rule of thumb is ~ 5 secs per page. Here the scraper has already found 20k pages. Maybe it will find more while exploring, but let's start with this number, it gives us 100k secs, or ~ 28 hours. You can increase the parallelism with --workers setting. By default, it is 1. Increasing it to 4 is a maximum recommended (even if you can go further, this is like the amount of tabs running simultaneously in the browser, too many tabs means more memory/cpu and increased likely-hood of behind banned / blocked). With this you can decrease the thing to about 7 hours.

Popolechien · 2024-12-03T14:38:15Z

do we have "official" prices for that?

no, because we don't know in advance what the final size of the zim file will be. We could guesstimate hardware/storage costs by averaging the size of our regular zimit files, and add some labour costs. This however would only be valid for zim files that can be produced off-the-shelf, meaning without any additional tinkering of the parameters or codefix on our part.

Back-of-the-enveloppe calculations say we should be anywhere between $30-100, and ideally this should be done via a portal and minimal operational involvement on our part. But seeing how many projects we have currently ongoing I would be loathe to commit to yet another one for the time being.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zimit refuses to Zim AnandTech. #438

Zimit refuses to Zim AnandTech. #438

Hacker-Anirudh commented Dec 1, 2024

benoit74 commented Dec 2, 2024

Hacker-Anirudh commented Dec 2, 2024

benoit74 commented Dec 2, 2024

Popolechien commented Dec 3, 2024

Zimit refuses to Zim AnandTech. #438

Zimit refuses to Zim AnandTech. #438

Comments

Hacker-Anirudh commented Dec 1, 2024

benoit74 commented Dec 2, 2024

Hacker-Anirudh commented Dec 2, 2024

benoit74 commented Dec 2, 2024

Popolechien commented Dec 3, 2024