Slow heap allocations in a Wasm runtime #4592

epsylonix · 2024-11-11T16:50:05Z

I'm investigating a comparatively poor performance of a Wasm guest compiled using TinyGo when it is executed in Wasmtime embedded into a Go service (compiled using the official Go compiler) when there is any non-trivial (megabytes) data exchange between the host and the guest. When trying to make the smallest possible example that would still have these performance issues I found that a test that just calls the exported by default functions malloc and free from the host already exhibits an unexpectedly poor performance.

Here is the test:

var allocationSizes = []int{100, 1024, 1024 * 100, 1024 * 1024, 200 * 1024 * 1024}

func BenchmarkMemAllocation(b *testing.B) {
	wasmPath := "./wasm/tinygo.wasm"

	wasmtimeConfig := wasmtime.NewConfig()
	defer wasmtimeConfig.Close()

	engine := wasmtime.NewEngineWithConfig(wasmtimeConfig)
	defer engine.Close()

	wasiConfig := wasmtime.NewWasiConfig()
	wasiConfig.InheritStdout()
	wasiConfig.InheritStderr()
	defer wasiConfig.Close()

	store := wasmtime.NewStore(engine)
	store.SetWasi(wasiConfig)
	defer store.Close()

	linker := wasmtime.NewLinker(engine)
	defer linker.Close()

	err := linker.DefineWasi()
	if err != nil {
		panic(err)
	}

	module, err := wasmtime.NewModuleFromFile(engine, wasmPath)
	if err != nil {
		panic(err)
	}

	instance, err := linker.Instantiate(store, module)
	if err != nil {
		panic(err)
	}

	malloc := instance.GetExport(store, "malloc").Func()
	free := instance.GetExport(store, "free").Func()

	for _, allocationSize := range allocationSizes {
		label := "size_" + strconv.Itoa(allocationSize)

		b.Run(label, func(b *testing.B) {
			for range b.N {
				ptr, err := malloc.Call(store, allocationSize)
				if err != nil {
					panic(err)
				}
				_, err = free.Call(store, ptr)
                                 if err != nil {
					panic(err)
				}
			}
		})
	}
}

I've run this benchmark with these guests:

Compiled with tinygo build -o $(WASM)/tinygo.wasm -target=wasi
Same but compiled with increased initial memory --initial-memory=209715200
Compiled with a custom gc/allocator (https://github.com/wasilibs/nottinygc)
An AssemblyScript guest (requires changing the names of the malloc/free functions in the test)

Here are the results on my machine:

go test -bench='BenchmarkMemAllocation' ./...

# TinyGo
BenchmarkMemAllocation/size_100                603967              2223 ns/op
BenchmarkMemAllocation/size_1024               383356              3362 ns/op
BenchmarkMemAllocation/size_102400               4455            257539 ns/op
BenchmarkMemAllocation/size_1048576               276           4267167 ns/op
BenchmarkMemAllocation/size_209715200               9         300325597 ns/op

# TinyGo with initial memory set to 200 MB
BenchmarkMemAllocation/size_100-12             630244              2169 ns/op
BenchmarkMemAllocation/size_1024-12            510651              3135 ns/op
BenchmarkMemAllocation/size_102400-12           10000            111715 ns/op
BenchmarkMemAllocation/size_1048576-12           4251           1365305 ns/op
BenchmarkMemAllocation/size_209715200-12            4         484141479 ns/op

# TinyGo with custom GC
BenchmarkMemAllocation/size_100-12             706581              1940 ns/op
BenchmarkMemAllocation/size_1024-12            693205              1904 ns/op
BenchmarkMemAllocation/size_102400-12          555777              2392 ns/op
BenchmarkMemAllocation/size_1048576-12         530181              2396 ns/op
BenchmarkMemAllocation/size_209715200-12       559406              2338 ns/op

# AssemblyScript
BenchmarkMemAllocation/size_100-12             702817              1967 ns/op
BenchmarkMemAllocation/size_1024-12            618452              2024 ns/op
BenchmarkMemAllocation/size_102400-12          611690              1986 ns/op
BenchmarkMemAllocation/size_1048576-12         620326              2046 ns/op
BenchmarkMemAllocation/size_209715200-12       636948              2044 ns/op

As you can see, the TinyGo-compiled guest with the built-in allocator seems to slow down significantly when the allocated memory size increases. It takes almost 5 milliseconds for allocating 1 MB and 0.3-0.4 seconds for 200 MB which seems slow by any standard. This is orders of magnitude slower that what I get with the custom allocator/GC or in a guest compiled with with AssemblyScript.

Is this a known issue or am I doing something wrong? Maybe there is some compilation flag that could improve the performance?

https://github.com/wasilibs/nottinygc is archived so I've included it for the benchmarks but it is not a viable option.

The text was updated successfully, but these errors were encountered:

aykevl · 2024-11-12T10:22:06Z

Unfortunately C.malloc and C.free are somewhat of a worst case: the GC can't assume anything about them and will always scan them even though they might be entirely pointer-free.
I recently switched the heap to use the (mostly) precise GC instead of the entirely conservative one, which means that large allocations that don't contain pointers (such as strings and byte slices) do not need to be scanned by the GC and therefore won't make the GC much slower.

To give a better idea of why it is slow and how this can be fixed, it would help if you could give some information on the specific workload. Such as:

number of goroutines
heap size
how much of that is taken up by pointer-free objects (such as strings, byte slices, []int, etc).

Also see: #3899 and #4550

epsylonix · 2024-11-13T07:23:49Z

Thank you for looking into it.

I recently switched the heap to use the (mostly) precise GC instead of the entirely conservative one, which means that large allocations that don't contain pointers (such as strings and byte slices) do not need to be scanned by the GC and therefore won't make the GC much slower.

Forgot to mention that I ran these benchmarks on TinyGo v0.33.0, CPU is Apple M3 Pro. As far as I can understand from the changelog the precise GC was introduced in v.0.34.0. Rerunning the benchmark with v.0.34.0 shows these results:

BenchmarkMemAllocation/size_100-12                590713              2133 ns/op
BenchmarkMemAllocation/size_1024-12               353888              3431 ns/op
BenchmarkMemAllocation/size_102400-12               4881            244366 ns/op
BenchmarkMemAllocation/size_1048576-12               324           3730553 ns/op
BenchmarkMemAllocation/size_209715200-12              10         251992683 ns/op

With v0.33.0 it was

BenchmarkMemAllocation/size_100                603967              2223 ns/op
BenchmarkMemAllocation/size_1024               383356              3362 ns/op
BenchmarkMemAllocation/size_102400               4455            257539 ns/op
BenchmarkMemAllocation/size_1048576               276           4267167 ns/op
BenchmarkMemAllocation/size_209715200               9         300325597 ns/op

It seems to be in the same ballpark and still significantly slower than the results I get with the custom GC (https://github.com/wasilibs/nottinygc) for example, or with the AssemblyScript version.

I am new to Wasm so it is possible I'm missing some configuration option or making an obvious mistake. Are these numbers what you would expect or do they look wrong?

To give a better idea of why it is slow and how this can be fixed, it would help if you could give some information on the specific workload

I'm currently evaluating if WebAssembly would meet my requirements and the way I plan to use it might change because what I have in mind might not be the best way to work with Wasm.

I would like to have several Wasm guests that can interact with each other through the host. A Wasm guest would be able to call a host-exported function to make an http request for example and then call other host-exported functions while processing the response (which can be quite large).

The component model support is only available in Wasmtime currently and only when it is embedded in Rust, so my plan was to implement the data exchange between a host and a guest through the linear memory for now. This means the data would have to be serialized and deserialized, likely zero-copy would be difficult to implement. Because of that I expect that there will be a number of heap allocations in the process.

If several interactions between a host and a guest are made with several megabytes worth of data, the overhead just for memory allocations could be tens of milliseconds and if the data reaches a hundred MB then it would probably increase to seconds if these benchmarks are accurate.

I expect a guest module to have just a handful of goroutines, the heap size would be up to 2 GB and likely it will be mostly taken by structs or maps with values containing relatively large []byte slices or strings.

aykevl · 2024-11-13T08:54:00Z

Can you try the following:

Use TinyGo 0.34.0
Use -opt=2 to optimize for performance over code size
Use a custom allocator that allocates byte slices instead of the exported malloc and free functions that can't be optimized much

You can write a custom allocator like this (untested):

// NOTE: //export may be removed in the future but is likely going to be faster for now

var buffers = map[*byte]struct{}

//export bufalloc
func bufalloc(size uintptr) *byte {
    s := make([]byte, size)
    buffers[&s[0]] = struct{}
    return &s[0]
}

//export buffree
func buffree(ptr *byte) {
    // doesn't actually free, will cause a GC cycle eventually (same as the exported `free` function actually)
    delete(buffers, ptr)
}

Of course it will lead to memory corruption if you actually store pointers to other heap-allocated objects in there, but for non-pointer data it should be a big GC speedup.

epsylonix · 2024-11-13T12:27:10Z

Unfortunately it shows similar results. I've ran the benchmark using the suggested implementation with a minor fix:

var buffers = map[*byte]struct{}{}

//export bufalloc
func bufalloc(size uintptr) *byte {
	s := make([]byte, size)
	buffers[&s[0]] = struct{}{}
	return &s[0]
}

//export buffree
func buffree(ptr *byte) {
	delete(buffers, ptr)
}

Results:

BenchmarkMemAllocation/size_100                   579721              2665 ns/op
BenchmarkMemAllocation/size_1024                  315070              4192 ns/op
BenchmarkMemAllocation/size_102400                  4911            237693 ns/op
BenchmarkMemAllocation/size_1048576                  336           3681836 ns/op
BenchmarkMemAllocation/size_209715200                  9         301891597 ns/op

Actually I've tested a similar approach before with the same results, then checked how malloc/free were implemented and it was essentially the same as what I was testing so I didn't mention this here.

Should it produce different results from the default implementation in

tinygo/src/runtime/arch_tinygowasm_malloc.go

Lines 13 to 34 in 2a76ceb

    
           //export malloc 
        
           func libc_malloc(size uintptr) unsafe.Pointer { 
        
           	if size == 0 { 
        
           		return nil 
        
           	} 
        
           	buf := make([]byte, size) 
        
           	ptr := unsafe.Pointer(&buf[0]) 
        
           	allocs[uintptr(ptr)] = buf 
        
           	return ptr 
        
           } 
        
           //export free 
        
           func libc_free(ptr unsafe.Pointer) { 
        
           	if ptr == nil { 
        
           		return 
        
           	} 
        
           	if _, ok := allocs[uintptr(ptr)]; ok { 
        
           		delete(allocs, uintptr(ptr)) 
        
           	} else { 
        
           		panic("free: invalid pointer") 
        
           	} 
        
           }

?

epsylonix · 2024-11-13T12:31:15Z

Another issue is that running that benchmark for a longer time, e.g. with -benchtime=10s results in an out of memory crash:

WASMTIME_BACKTRACE_DETAILS=1 GOMAXPROCS=1 go test -benchtime=10s -bench='BenchmarkMemAllocation' ./...

BenchmarkMemAllocation/size_100                  4367896              2912 ns/op
BenchmarkMemAllocation/size_1024                 3280836              3726 ns/op
BenchmarkMemAllocation/size_102400                 71882            167057 ns/op
BenchmarkMemAllocation/size_1048576                 3390           3164666 ns/op
BenchmarkMemAllocation/size_209715200           panic: runtime error: out of memory
       0                       NaN ns/op
panic: error while executing at wasm backtrace:
            0: 0x150f - runtime.abort
                            at /opt/homebrew/Cellar/tinygo/0.34.0/src/runtime/runtime_tinygowasm.go:78:6              - runtime.runtimePanicAt
                            at /opt/homebrew/Cellar/tinygo/0.34.0/src/runtime/panic.go:90:7
            1: 0x1de8 - runtime.alloc
                            at /opt/homebrew/Cellar/tinygo/0.34.0/src/runtime/gc_blocks.go:324:20
            2: 0x1927 - malloc
                            at /opt/homebrew/Cellar/tinygo/0.34.0/src/runtime/arch_tinygowasm_malloc.go:18:13

        Caused by:
            wasm trap: wasm `unreachable` instruction executed

In the code I've posted earlier the _start call was missing so I've updated it with

start := instance.GetExport(store, "_start").Func()
_, err = start.Call(store)
if err != nil {
  panic(err)
}

But this doesn't help with the out of memory error. Is it because GC doesn't initiate garbage collection for some reason?
I've noticed there is this note in docs https://tinygo.org/docs/reference/lang-support/:

Garbage collection generally works fine, but may work not as well on very small chips (AVR) and on WebAssembly

Is it related to this issue?

I wonder if the poor performance here might be at least partially caused by the fact that the GC doesn't collect garbage in time and that causes the linear memory to grow many times with reallocations on the wasm runtime side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow heap allocations in a Wasm runtime #4592

Slow heap allocations in a Wasm runtime #4592

epsylonix commented Nov 11, 2024 •

edited

Loading

aykevl commented Nov 12, 2024

epsylonix commented Nov 13, 2024

aykevl commented Nov 13, 2024 •

edited

Loading

epsylonix commented Nov 13, 2024 •

edited

Loading

epsylonix commented Nov 13, 2024 •

edited

Loading

Slow heap allocations in a Wasm runtime #4592

Slow heap allocations in a Wasm runtime #4592

Comments

epsylonix commented Nov 11, 2024 • edited Loading

aykevl commented Nov 12, 2024

epsylonix commented Nov 13, 2024

aykevl commented Nov 13, 2024 • edited Loading

epsylonix commented Nov 13, 2024 • edited Loading

epsylonix commented Nov 13, 2024 • edited Loading

epsylonix commented Nov 11, 2024 •

edited

Loading

aykevl commented Nov 13, 2024 •

edited

Loading

epsylonix commented Nov 13, 2024 •

edited

Loading

epsylonix commented Nov 13, 2024 •

edited

Loading