Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zimit refuses to Zim AnandTech. #438

Open
Hacker-Anirudh opened this issue Dec 1, 2024 · 4 comments
Open

Zimit refuses to Zim AnandTech. #438

Hacker-Anirudh opened this issue Dec 1, 2024 · 4 comments

Comments

@Hacker-Anirudh
Copy link

Seeing as AnandTech has shut down (and I don't really trust Future to be true to their word and keep it up forever) I decided I wanted to Zimit, seeing as it is a treasure trove of information. That did not work, I tried twice, and I got the same error:

sizeLimit is 4294967296 and timeLimit is 7200 if that helps. I have attached a screenshot of my second attempt:

Image

@benoit74
Copy link
Collaborator

benoit74 commented Dec 2, 2024

The problem is that the tool looks only at the first 1024 bytes of each HTML document to find its encoding. Typical recommendation is to put the <meta http-equiv="Content-Type" content="text/html; charset=xxxxx"> at the very beginning of the document. This is not the case on https://www.anandtech.com/, were we need few more bytes.

We have an advanced switch to control this: https://github.com/openzim/warc2zim/blob/main/src/warc2zim/main.py#L129-L134

Unfortunately, this switch is not exposed in zimit.kiwix.org because it is deemed way too complex for normal user and very risky for our worker should someone abuse it (reading too much data is going to make this process consumes lots of memory / be too slow).

Note that in 2 hours only 1434 pages have been crawled out of 20420 already found. So anyway the ZIM is going to be far from complete. Would you be interested in purchasing more time to complete the ZIM?

@Hacker-Anirudh
Copy link
Author

@benoit74 I would not mind purchasing some more time given the price is right, but I also am interested in running it locally. Would a quad core Haswell chip, 16 gigs of RAM, and 256 gigs of free storage (1TB SSD) suffice? How long would that take anyway?

@benoit74
Copy link
Collaborator

benoit74 commented Dec 2, 2024

I would not mind purchasing some more time given the price is right

@Popolechien do we have "official" prices for that?

Would a quad core Haswell chip, 16 gigs of RAM, and 256 gigs of free storage (1TB SSD) suffice?

CPU and RAM are more than enough. No certainty regarding the total disk size since I have no idea how big the website is (you have to account for about twice the final ZIM size, we have to store a temporary file before creating the ZIM which is about the same size as the ZIM). But from what I saw, it seems definitely feasible.

How long would that take anyway?

Rule of thumb is ~ 5 secs per page. Here the scraper has already found 20k pages. Maybe it will find more while exploring, but let's start with this number, it gives us 100k secs, or ~ 28 hours. You can increase the parallelism with --workers setting. By default, it is 1. Increasing it to 4 is a maximum recommended (even if you can go further, this is like the amount of tabs running simultaneously in the browser, too many tabs means more memory/cpu and increased likely-hood of behind banned / blocked). With this you can decrease the thing to about 7 hours.

@Popolechien
Copy link
Contributor

do we have "official" prices for that?

no, because we don't know in advance what the final size of the zim file will be. We could guesstimate hardware/storage costs by averaging the size of our regular zimit files, and add some labour costs. This however would only be valid for zim files that can be produced off-the-shelf, meaning without any additional tinkering of the parameters or codefix on our part.

Back-of-the-enveloppe calculations say we should be anywhere between $30-100, and ideally this should be done via a portal and minimal operational involvement on our part. But seeing how many projects we have currently ongoing I would be loathe to commit to yet another one for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants