Performance reading/writing large zip files #387
Replies: 8 comments
-
Just been trying a bit of experimentation myself. I tried making a basic ArrayBufferReader: class ArrayBufferReader extends Reader {
constructor(arrayBuffer) {
super();
Object.assign(this, {
arrayBuffer,
size: arrayBuffer.byteLength
});
}
readUint8Array(offset, length) {
const reader = this;
return new Uint8Array(reader.arrayBuffer, offset, length);
}
} However reading a zip file with this fails with If I instead fix it by slicing the ArrayBuffer, which means making lots of tiny copies, and leaving the DataView code as-is: return new Uint8Array(reader.arrayBuffer.slice(offset, offset + length)); This brings down the time to read the zip from 16.2 seconds with BlobReader to 7.9 seconds with ArrayBufferReader - a 50% improvement even though it's still having to copy the ArrayBuffer - a big improvement for just a few lines of code! This suggests to me there is indeed low-hanging fruit here. It would be nice to see a built-in ArrayBufferReader along these lines - callers could then decide whether to read zips with BlobReader or ArrayBuffer depending on the zip size and memory requirements. Ideally it could avoid slicing the ArrayBuffer (it shouldn't be necessary for reading) while still being faster, but I'm not sure what was going on with that benchmark. Perhaps another approach could be to just create one DataView across the entire ArrayBuffer and always refer to that, applying the offsets to its get methods. I'm not sure how this would all apply to writing but perhaps similar gains could be achieved without too much work. |
Beta Was this translation helpful? Give feedback.
-
I tried your test and I noticed that in that particular case, the call to Regarding the |
Beta Was this translation helpful? Give feedback.
-
Ah, don't know how I missed that! Yeah, it works great for reading and improves performance a lot. For writing though, Uint8ArrayWriter is slower than BlobWriter - BlobWriter takes 12.1 seconds, Uint8ArrayWriter takes 14.9 seconds. I think this is because of O(n^2) performance in Uint8ArrayWriter - writes past the end have to copy all the existing content to a new buffer and then add the new content (https://github.com/gildas-lormeau/zip.js/blob/master/lib/core/io.js#L518). It's faster to just queue up all the writes and copy them all to a single buffer in class Uint8ArrayWriter extends Writer {
init(initSize = 0) {
super.init();
Object.assign(this, {
offset: 0,
arrayParts: [],
totalSize: 0
});
}
writeUint8Array(array) {
const writer = this;
writer.arrayParts.push(array);
writer.totalSize += array.length;
}
getData() {
const writer = this;
const arrayBuffer = new ArrayBuffer(writer.totalSize);
const ret = new Uint8Array(arrayBuffer);
let i = 0;
for (const array of writer.arrayParts)
{
ret.set(array, i);
i += array.length;
}
return ret;
}
} For me this takes the zip write time down to about 11 seconds. That's much faster than the previous Uint8ArrayWriter implementation, but still only a little faster than BlobWriter. One possible reason it's still slow is blob() calls still come up in the profile for writing a zip with Uint8ArrayWriter. This seems to be because getFileEntry() creates a Also, even with Uint8ArrayReader, 7 seconds still seems pretty slow to read 10000 files - that's still 42 seconds to read a zip with 60000 files. Surely it should be able to read 10000 files in ~1 second? A Chrome profile shows it seems to be extremely heavy on posting tasks to the main thread. I suspect it's a bit of a worst case scenario for task posting. I'm not sure what ultimately is causing that (maybe the use of streams?) or if perhaps a browser issue should be filed for browsers to optimise it. |
Beta Was this translation helpful? Give feedback.
-
I agree that your implementation is faster but it also consumes twice as much RAM. Note also that when reading entries in a zip file, Actually, I've seen similar issues with Lines 188 to 225 in 9b56c15 Maybe zip.js should provide 2 sub-types of Apart from that, I will try to do some profiling tests to see if I can find some points of improvement but I feel that it will be hard. |
Beta Was this translation helpful? Give feedback.
-
I forgot to mention that zip.js relies on |
Beta Was this translation helpful? Give feedback.
-
That's a good point about the RAM usage and maximum supported size, I hadn't thought about those. For our case it's not a huge problem, but then there will be other cases where those are important concerns. You're right that custom reader/writer classes gives us scope to make our own tradeoffs - we can switch to using a customised writer to improve performance for our specific case. More built-in classes for these tradeoffs would be nice but in a way that is supported already with the custom classes. I still think the default blob reader/writer classes are strangely slow though. It's hard to make sense of a profile as it seems to mostly be thousands of tiny tasks, and so I'm inclined to suspect there is too much task-posting overhead, but the profile doesn't really offer any clues as to why so many tasks are posted (I suspected the stream API as a bit of a wild guess). Perhaps for this aspect we could escalate it to a browser developer like Google and see what their engineers say? Optimising a widely-used zip library should be a compelling optimisation target for them. |
Beta Was this translation helpful? Give feedback.
-
Well, the Do you know how much slower zip.js is compared to another program/implementation (in any language/lib) when running this test? I didn't do this kind of comparison tests on my end. |
Beta Was this translation helpful? Give feedback.
-
Hi, and thanks for making a great library!
I've been looking in to the performance of using zip.js using very large zip files. In our case we can end up dealing with 100mb+ zip files with 75000+ files. I think zip.js could be faster at handling this.
I made a small benchmark here (host the files on a webserver and load index.html):
zipjs-benchmark.zip
The benchmark writes a zip with 10000 tiny text files, then reads the zip extracting all the files, using default settings. (It reads and writes everything asynchronously in parallel, but throttles to max 16 concurrent, otherwise you can get some pretty pathological performance characteristics of dumping thousands of simultaneous read/writes. I don't think the throttling reduces the benchmark time in any significant way.)
Despite being on a high-end desktop, it takes ~1ms per file. That might sound fast, but it's only about 1000 files per second. So if you have a zip with 60000 files, it will take at least a minute to process it (either reading or writing), even on a high end system, which seems like a long time to wait given how fast modern systems are.
With both reading and writing, a Chrome profile shows performance bottlenecked on the main thread - using web workers doesn't look like it helps here. In both cases around half the main thread CPU time appears to be in "Run Microtasks" and "slice".
I would guess zip.js uses a great deal of promises and slices for thousands of tiny bits of work. Perhaps there is some low-hanging fruit to optimise this? For example these days most devices have so much memory that loading a 200mb zip in to memory isn't really a problem, so maybe zip.js could provide a way to read/write zips directly to/from one big ArrayBuffer, creating lightweight Uint8Arrays directly on that ArrayBuffer without making any copies. I tried prototyping that for reading zips but I couldn't get it working without slicing the ArrayBuffer, so I guess there are some internal details I didn't get right. Or maybe it could use chunks of ArrayBuffers as a sort of cache (although that might not work well with random access). Or maybe something else. Anyway, just thought I'd file this in case anyone has some good ideas.
Beta Was this translation helpful? Give feedback.
All reactions