Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP, RFC] Problem: lot of use of dynamic memory allocation, leading to performance degradation and non-determism #3911

Draft
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

mjvankampen
Copy link
Contributor

@mjvankampen mjvankampen commented May 12, 2020

See #3644

Basically the problem we are trying to solve is that the default allocator (malloc/free) is:

  • slow
  • unpredictable

@f18m showed that using pre-allocated memory that is re-used performance can be increased significantly. My own use-case is somewhat different as I prefer to have as little dynamic memory allocation in my applications as determinism is an element for me. Allocation on the heap is a source of non-determinism.

I took @f18m pull request and tried to make a a bit nicer to allow users to provide their own queues and made the queue grow dynamically.

Open discussion points:

  • Do we need to allow the user to provide an allocator, or is the zero copy feature enough?
  • SPSC queue: remove I guess, @somdoron convinced it will not work, @f18m tends to agree (I don't have the insight to judge this)
  • <C++11 implementation: std::queue + mutex or hide global pool behind ifdefs? I prefer a (slower) <C++11 implementation
  • External integration: ok to integrate concurrentqueue.h like this? I feel like writing a custom queue for zmq seems a too big effort, but it does not align with C4 2.3.5.
  • Naming: ZMQ_MSG_ALLOCATOR_GLOBAL_POOL or ZMQ_ALLOCATOR_GLOBAL_POOL. Is it used anywhere else outside the msg api?
  • What level of testing is sufficient: I would add another test with a larger size
  • Add another message type for allocator: is this required or is the current setup fine
  • Add allocator to context: so that other msg_init are done using the custom allocator. Is this a feature that is required/possible (at this stage)?
  • Any other suggestions? Such as required features?
  • Is the current dynamic resize ok: it adds a bit of overhead vs the compile time bins.

Todo

  • Add RFC for better PR description
  • Update documentation
  • Add statistics from memory pool
  • Fix CI: broken build on some platforms
  • Add unit tests: coverage below par
  • Change to function pointers, virtuals cannot be exposed through zmq.h
  • Fix missing headers (something with pch)
  • Adds free_fn for allocator

RFC

Problem

ZMQ uses quite a bit of dynamic memory allocation. This results in

  • some loss in performance as heap allocation takes time, can introduce implicit synchronisation and lead to heap fragmentation, and
  • non-determinism as most new/malloc implementations do not have a upper bound time limit.

The first one is not a problem in itself but as one ZMQ strengths is performance it is always good to improve this. The second one is an issue in some systems that require a bound on latency.

Solution direction

The most simple solution would be to allow users to provide a custom allocator that can be used instead of the default allocator. This has the advantage that users can customize the application to their needs and solve the above issues. A user can select the trade-off between memory vs. speed vs. determinism themselves. Of course it adds some disadvantages as well:

  • some overhead is likely to be introduced (or some templating);
  • a new API is required; and
  • this allows users to break ZMQ with their custom allocator, resulting in more questions for the ZMQ community.

To mitigate these somewhat a standard implementation for an allocator should be provided.

Default allocator usage

While there are a quite a few new calls in zmq, Approximately 100 new and 50 malloc calls. Replacing all these is quite a bit of work and most likely not necessary as most of these happen during initialisation. Focus should be on the allocations that happen often and all the time. The best place to look for this is the code that does an allocation for every message sent or received. This results in allocations in:

  • msg.hpp/cpp, and
  • decoders (decoder_allocators.hpp/cpp).

At the moment (correct me if I'm wrong) the decoder is does not allocate per message but per stream. It seems that the most gain is in the messages, which means the decoder becomes a nice to have (at this point).

APIs

  • message level API: set an allocator per message.

Another option is to set an allocator for all messages (using the context). In my opinion this is a nice to have, and should only be implemented if someone asks for it.

zmq_allocator_new(type)

Construct an allocator of type. This is a way to provide default allocators to users. If they want to provide their own custom allocator, in that case this function is not used.

zmq_allocator_destroy(allocator)

Destroy an allocator (either from the user or native) after you are done with it. Can only be called when all messages have been handled. This means after terminating the context.

zmq_msg_init_allocator(message, size, allocator)

Initialize a message using a custom allocator. Very similar to zmq_msg_init_size except with an argument to pass an allocator.

zmg_allocator_t

A struct consisting of function pointers.

void*(*allocate_fn) (void *allocator, size_t len)
Allocate function

void(*deallocate_fn) (void *allocator, void *data_)
Deallocate function

bool(*check_tag_fn) (void *allocator)
Check the tag of an allocator to see if the allocator pointer is valid

void *allocator
Pointer to an allocator itself (on which to use the above functions)

optional (not required for the msg interface):
void(*release_fn) (void *allocator, void *data_)
Release data, this means the allocator is not the owner anymore.

Provided allocators

Default

A basic allocator using new/delete. Basically the same what we have now but with a slight added cost of indirection through a vtable (might be elimated using an interface and marking things as final).

Global pool

This is an allocator which keeps queues of memory chunks based on bin sizes. This reduces the amount of dynamic memory allocation (thus gaining speed) at the cost of reserving more memory than strictly needed. To prevent an unreasonable startup size memory gets allocated on demand, but with the power of two. So if more memory chunks are required than available in a bin the number of chunks in a bin is doubled. If the biggest bin is too large, a bin twice the size is created.

Depending on the workload this way of working can lead to excessive memory use. Scenarios where large burst of same message size, but changing message sizes between bursts, might lead to excessive memory use. If we want to prevent this it might be wise to set a limit to the pre-allocated memory and fall back on new/delete if it is exceeded or return an error. Or we trust the user to not pick this specific allocator.

Other?

f18m and others added 30 commits August 13, 2019 11:19
by completely removing malloc/free from the hot path of benchmark util
…pool

# Conflicts:
#	src/msg.cpp
#	src/msg.hpp
#	src/zmq.cpp
In practice I observed that 256 byte messages require +- 1024 messages
pre-allocated to reach max. performance on my pc.
This scales depending on message size I believe.
src/allocator_default.cpp Outdated Show resolved Hide resolved
ZMQ_EXPORT int
zmq_msg_init_allocator (zmq_msg_t *msg_, size_t size_, void *allocator_);

struct zmq_allocator_t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be exposed? If so, can the type not be used in the API, as in ZMQ_EXPORT zmq_allocator_t *zmq_msg_allocator_new (int type_);?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't expose this one users cannot provide their own allocator with the current api. This is a design decision. I would allow users to provide their own allocators (maybe they want fixed bins with new, fixed bins with errors, an off the shelves allocator etc.). But of course that depends on how much you want to expose to the user. What do you think about this?

void(*deallocate_fn) (void *allocator, void *data_);

// Return true if this is an allocator and alive, otherwise false
bool(*check_tag_fn) (void *allocator);
Copy link
Member

@gummif gummif May 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? In what circumstances will this be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the users wants to provide his/her own allocator instead of using one of the built-ins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is specifically the check_tag_fn function. I don't see why it is necessary at all. Is it used for trying to catch programming errors at runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would allow you to check if a void pointer is a pointer to what you think it is. It was in the previous PR so I left it in. Need to check if it is actually used somewhere.

return success;
}

size_t size_approx () const { return _queue.size (); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread-safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence the size_approx() instead of size. Might be tricky though not sure what size() returns during a change. Will see if I can find it somewhere in the standard and otherwise will add a mutex.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on arch, you could get get a torn-read. Although since it is size_t, it should be the platform's native word size and therefore safe-ish.
If size_approx needs to be fast, alternatively you could use an std::atomic size counter with relaxed memory ordering.

_storage_mutex.lock ();
size_t oldSize = _storage.size ();
if (oldSize <= bl) {
_storage.resize (bl + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this function is being locked with a mutex, these member variables are accessed in other functions without locks, which is not safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Changed a few things around. Should be better now. Still need to check thoroughly.


private:
// Used to check whether the object is a socket.
uint32_t _tag;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see _tag is marked as dead in destructor, but the compiler will optimize that out unless _tag is volatile-qualified (which could have performance implications elsewhere).

header->capacity = nextBlockIndexCapacity;
header->tail.store((prevCapacity - 1) & (nextBlockIndexCapacity - 1), std::memory_order_relaxed);

blockIndex.store(header, std::memory_order_release);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEMORY_LEAK: memory dynamically allocated at line 2931 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc, is not reachable after line 2963, column 4.

if (!raw)
return nullptr;
char* ptr = details::align_for<TAlign>(reinterpret_cast<char*>(raw) + sizeof(void*));
*(reinterpret_cast<void**>(ptr) - 1) = raw;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEMORY_LEAK: memory dynamically allocated at line 3556 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc, is not reachable after line 3560, column 3.

// If it's < three-quarters full, add to the old one anyway so that we don't have to wait for the next table
// to finish being allocated by another thread (and if we just finished allocating above, the condition will
// always be true)
if (newCount < (mainHash->capacity >> 1) + (mainHash->capacity >> 2)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEMORY_LEAK: memory dynamically allocated at line 3430 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc, is not reachable after line 3458, column 8.

@axelriet
Copy link
Contributor

I've looked at this as I, too, believe it's worth doing something as in some scenario (inproc and presumably with other very fast transports) the dominant location where time is spent is the heap. However I think the approach taken here is too complex. All that is needed is a couple of strategically placed hooks for malloc/free, a couple of typedefs and one public function to pass-on two function pointers, then plug a faster heap. I did just that and used Intel's TBB scalable allocator instead of the CRT heap and the speed gains are in the 20% range below the vsm threshold (33 bytes message payload or less) and ranging from 30% to 70% depending on message size after that. In fact it works so well I later made using TBB a build option in itself so TBB can be the new default, or you can supply custom functions in case you think you can do better.

@f18m
Copy link
Contributor

f18m commented Dec 21, 2023

I've looked at this as I, too, believe it's worth doing something as in some scenario (inproc and presumably with other very fast transports) the dominant location where time is spent is the heap. However I think the approach taken here is too complex. All that is needed is a couple of strategically placed hooks for malloc/free, a couple of typedefs and one public function to pass-on two function pointers, then plug a faster heap. I did just that and used Intel's TBB scalable allocator instead of the CRT heap and the speed gains are in the 20% range below the vsm threshold (33 bytes message payload or less) and ranging from 30% to 70% depending on message size after that. In fact it works so well I later made using TBB a build option in itself so TBB can be the new default, or you can supply custom functions in case you think you can do better.

hi @axelriet, do you have any "poc quality code" that you can share around your TBB-based approach?
I would be interested in having a look... (I originally raised #3644 but then exhausted the time I had available to complete the feature... as you said maybe it was a too complex approach after all)

@axelriet
Copy link
Contributor

axelriet commented Dec 21, 2023

@f18m Yep you can check my fork of the project. It’s a bit of a battlefield atm, but it builds on Windows, all tests pass, and it’s a drop-in binary replacement so it should be easy to experiment with. The alloc hooks and defaulting to TBB are two separate things that you can combine or not. If you want to try Intel’s allocator just get the latest oneApi from their website. They don’t supply a static lib (should be possible to build one, though) so you’ll have to add tbbmalloc.dll to your binaries. ETA: I've TBB-ified all the low-level constructs now (tries, queues...) The custom alloc and TBB alloc work is spread amongst several commits, the main ones being this one and that one. I think I can claim a solid 30% increase in throughput using inproc_thr.exe but I expect the benefits to be even greater on real-world server workloads with lots of competition inside and outside the process. There is one new public function to support the feature:

ZMQ_EXPORT (bool)
zmq_set_custom_msg_allocator (_In_ zmq_custom_msg_alloc_fn *malloc_,
                              _In_ zmq_custom_msg_free_fn *free_);

If never called, the default is either malloc/free or TBB's depending on build options.

As a side note I've added support for Microsoft Hyper-V's HvSocket transport for guest-host communication (which incidentally is now the fastest zmq IPC on Windows) as well as Vsock for Linux VMs to talk to their host.

@axelriet
Copy link
Contributor

axelriet commented Jan 2, 2024

Benefits of a custom allocator for inproc workloads. Here I'm using Intel's Scalable Allocator, part of Intel oneAPI / Threading Building Blocks. There is definitely an advantage. One thing jumps out: the VSM optimization is very effective. Too bad it's limited to 33-bytes or less. What could be done to increase that number is reduce the threshold for long group names from 14 + z down to 10 + z chars, and increases the VSM threshold from <= 33 bytes to <= 37 bytes without changing the existing zmq_msg_t size and therefore without breaking binary compact with existing clients. That's a modest improvement but it will allow more small messages to go below the VSM break, and it's easy to restrict the group names to a smaller length (to avoid an external allocation) by convention.

image

@axelriet
Copy link
Contributor

axelriet commented Jan 3, 2024

I went ahead in my fork and was able to increase the vsm to 40 bytes (up from 33) by repacking the group_t and msg_t structs and reducing the short group names to 7 char.

Therefore, group names from 8 chars and up switch to long group names now (vs 15 chars and up previously) but the fast vsm performance applies to up to 40 bytes messages, which may be important to some users.

I think 40/7 is a better compromise than the original 33/14 choice as users can trivially shrink group names to 7 chars or less if they want max performances for small messages, but they could not increase the vsm threshold beyond 33 if their data didn't fit under it. Now they have seven more bytes to play with. I get that 9M msg/sec inproc_thr perf up to 40 bytes now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants