-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow heap allocations in a Wasm runtime #4592
Comments
Unfortunately To give a better idea of why it is slow and how this can be fixed, it would help if you could give some information on the specific workload. Such as:
|
Thank you for looking into it.
Forgot to mention that I ran these benchmarks on TinyGo v0.33.0, CPU is Apple M3 Pro. As far as I can understand from the changelog the precise GC was introduced in v.0.34.0. Rerunning the benchmark with v.0.34.0 shows these results:
With v0.33.0 it was
It seems to be in the same ballpark and still significantly slower than the results I get with the custom GC (https://github.com/wasilibs/nottinygc) for example, or with the I am new to Wasm so it is possible I'm missing some configuration option or making an obvious mistake. Are these numbers what you would expect or do they look wrong?
I'm currently evaluating if WebAssembly would meet my requirements and the way I plan to use it might change because what I have in mind might not be the best way to work with Wasm. I would like to have several Wasm guests that can interact with each other through the host. A Wasm guest would be able to call a host-exported function to make an http request for example and then call other host-exported functions while processing the response (which can be quite large). The component model support is only available in Wasmtime currently and only when it is embedded in Rust, so my plan was to implement the data exchange between a host and a guest through the linear memory for now. This means the data would have to be serialized and deserialized, likely zero-copy would be difficult to implement. Because of that I expect that there will be a number of heap allocations in the process. If several interactions between a host and a guest are made with several megabytes worth of data, the overhead just for memory allocations could be tens of milliseconds and if the data reaches a hundred MB then it would probably increase to seconds if these benchmarks are accurate. I expect a guest module to have just a handful of goroutines, the heap size would be up to 2 GB and likely it will be mostly taken by structs or maps with values containing relatively large |
Can you try the following:
You can write a custom allocator like this (untested): // NOTE: //export may be removed in the future but is likely going to be faster for now
var buffers = map[*byte]struct{}
//export bufalloc
func bufalloc(size uintptr) *byte {
s := make([]byte, size)
buffers[&s[0]] = struct{}
return &s[0]
}
//export buffree
func buffree(ptr *byte) {
// doesn't actually free, will cause a GC cycle eventually (same as the exported `free` function actually)
delete(buffers, ptr)
} Of course it will lead to memory corruption if you actually store pointers to other heap-allocated objects in there, but for non-pointer data it should be a big GC speedup. |
Unfortunately it shows similar results. I've ran the benchmark using the suggested implementation with a minor fix: var buffers = map[*byte]struct{}{}
//export bufalloc
func bufalloc(size uintptr) *byte {
s := make([]byte, size)
buffers[&s[0]] = struct{}{}
return &s[0]
}
//export buffree
func buffree(ptr *byte) {
delete(buffers, ptr)
} Results:
Actually I've tested a similar approach before with the same results, then checked how malloc/free were implemented and it was essentially the same as what I was testing so I didn't mention this here. Should it produce different results from the default implementation in tinygo/src/runtime/arch_tinygowasm_malloc.go Lines 13 to 34 in 2a76ceb
|
Another issue is that running that benchmark for a longer time, e.g. with
In the code I've posted earlier the start := instance.GetExport(store, "_start").Func()
_, err = start.Call(store)
if err != nil {
panic(err)
} But this doesn't help with the out of memory error. Is it because GC doesn't initiate garbage collection for some reason?
Is it related to this issue? I wonder if the poor performance here might be at least partially caused by the fact that the GC doesn't collect garbage in time and that causes the linear memory to grow many times with reallocations on the wasm runtime side. |
I'm investigating a comparatively poor performance of a Wasm guest compiled using TinyGo when it is executed in
Wasmtime
embedded into a Go service (compiled using the official Go compiler) when there is any non-trivial (megabytes) data exchange between the host and the guest. When trying to make the smallest possible example that would still have these performance issues I found that a test that just calls the exported by default functionsmalloc
andfree
from the host already exhibits an unexpectedly poor performance.Here is the test:
I've run this benchmark with these guests:
tinygo build -o $(WASM)/tinygo.wasm -target=wasi
--initial-memory=209715200
Here are the results on my machine:
As you can see, the TinyGo-compiled guest with the built-in allocator seems to slow down significantly when the allocated memory size increases. It takes almost 5 milliseconds for allocating 1 MB and 0.3-0.4 seconds for 200 MB which seems slow by any standard. This is orders of magnitude slower that what I get with the custom allocator/GC or in a guest compiled with with AssemblyScript.
Is this a known issue or am I doing something wrong? Maybe there is some compilation flag that could improve the performance?
https://github.com/wasilibs/nottinygc is archived so I've included it for the benchmarks but it is not a viable option.
The text was updated successfully, but these errors were encountered: