-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP, RFC] Problem: lot of use of dynamic memory allocation, leading to performance degradation and non-determism #3911
base: master
Are you sure you want to change the base?
Conversation
by completely removing malloc/free from the hot path of benchmark util
…pool # Conflicts: # src/msg.cpp # src/msg.hpp # src/zmq.cpp
In practice I observed that 256 byte messages require +- 1024 messages pre-allocated to reach max. performance on my pc. This scales depending on message size I believe.
Changed some options that should not have changed
Cpp equivalent of malloc/free
ZMQ_EXPORT int | ||
zmq_msg_init_allocator (zmq_msg_t *msg_, size_t size_, void *allocator_); | ||
|
||
struct zmq_allocator_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be exposed? If so, can the type not be used in the API, as in ZMQ_EXPORT zmq_allocator_t *zmq_msg_allocator_new (int type_);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't expose this one users cannot provide their own allocator with the current api. This is a design decision. I would allow users to provide their own allocators (maybe they want fixed bins with new, fixed bins with errors, an off the shelves allocator etc.). But of course that depends on how much you want to expose to the user. What do you think about this?
void(*deallocate_fn) (void *allocator, void *data_); | ||
|
||
// Return true if this is an allocator and alive, otherwise false | ||
bool(*check_tag_fn) (void *allocator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required? In what circumstances will this be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the users wants to provide his/her own allocator instead of using one of the built-ins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is specifically the check_tag_fn
function. I don't see why it is necessary at all. Is it used for trying to catch programming errors at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would allow you to check if a void pointer is a pointer to what you think it is. It was in the previous PR so I left it in. Need to check if it is actually used somewhere.
return success; | ||
} | ||
|
||
size_t size_approx () const { return _queue.size (); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence the size_approx() instead of size. Might be tricky though not sure what size() returns during a change. Will see if I can find it somewhere in the standard and otherwise will add a mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on arch, you could get get a torn-read. Although since it is size_t
, it should be the platform's native word size and therefore safe-ish.
If size_approx
needs to be fast, alternatively you could use an std::atomic
size counter with relaxed memory ordering.
_storage_mutex.lock (); | ||
size_t oldSize = _storage.size (); | ||
if (oldSize <= bl) { | ||
_storage.resize (bl + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this function is being locked with a mutex, these member variables are accessed in other functions without locks, which is not safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Changed a few things around. Should be better now. Still need to check thoroughly.
|
||
private: | ||
// Used to check whether the object is a socket. | ||
uint32_t _tag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see _tag
is marked as dead in destructor, but the compiler will optimize that out unless _tag
is volatile-qualified (which could have performance implications elsewhere).
header->capacity = nextBlockIndexCapacity; | ||
header->tail.store((prevCapacity - 1) & (nextBlockIndexCapacity - 1), std::memory_order_relaxed); | ||
|
||
blockIndex.store(header, std::memory_order_release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MEMORY_LEAK: memory dynamically allocated at line 2931 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc
, is not reachable after line 2963, column 4.
if (!raw) | ||
return nullptr; | ||
char* ptr = details::align_for<TAlign>(reinterpret_cast<char*>(raw) + sizeof(void*)); | ||
*(reinterpret_cast<void**>(ptr) - 1) = raw; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MEMORY_LEAK: memory dynamically allocated at line 3556 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc
, is not reachable after line 3560, column 3.
// If it's < three-quarters full, add to the old one anyway so that we don't have to wait for the next table | ||
// to finish being allocated by another thread (and if we just finished allocating above, the condition will | ||
// always be true) | ||
if (newCount < (mainHash->capacity >> 1) + (mainHash->capacity >> 2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MEMORY_LEAK: memory dynamically allocated at line 3430 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc
, is not reachable after line 3458, column 8.
I've looked at this as I, too, believe it's worth doing something as in some scenario (inproc and presumably with other very fast transports) the dominant location where time is spent is the heap. However I think the approach taken here is too complex. All that is needed is a couple of strategically placed hooks for malloc/free, a couple of typedefs and one public function to pass-on two function pointers, then plug a faster heap. I did just that and used Intel's TBB scalable allocator instead of the CRT heap and the speed gains are in the 20% range below the vsm threshold (33 bytes message payload or less) and ranging from 30% to 70% depending on message size after that. In fact it works so well I later made using TBB a build option in itself so TBB can be the new default, or you can supply custom functions in case you think you can do better. |
hi @axelriet, do you have any "poc quality code" that you can share around your TBB-based approach? |
@f18m Yep you can check my fork of the project. It’s a bit of a battlefield atm, but it builds on Windows, all tests pass, and it’s a drop-in binary replacement so it should be easy to experiment with. The alloc hooks and defaulting to TBB are two separate things that you can combine or not. If you want to try Intel’s allocator just get the latest oneApi from their website. They don’t supply a static lib (should be possible to build one, though) so you’ll have to add tbbmalloc.dll to your binaries. ETA: I've TBB-ified all the low-level constructs now (tries, queues...) The custom alloc and TBB alloc work is spread amongst several commits, the main ones being this one and that one. I think I can claim a solid 30% increase in throughput using inproc_thr.exe but I expect the benefits to be even greater on real-world server workloads with lots of competition inside and outside the process. There is one new public function to support the feature:
If never called, the default is either malloc/free or TBB's depending on build options. As a side note I've added support for Microsoft Hyper-V's HvSocket transport for guest-host communication (which incidentally is now the fastest zmq IPC on Windows) as well as Vsock for Linux VMs to talk to their host. |
I went ahead in my fork and was able to increase the vsm to 40 bytes (up from 33) by repacking the group_t and msg_t structs and reducing the short group names to 7 char. Therefore, group names from 8 chars and up switch to long group names now (vs 15 chars and up previously) but the fast vsm performance applies to up to 40 bytes messages, which may be important to some users. I think 40/7 is a better compromise than the original 33/14 choice as users can trivially shrink group names to 7 chars or less if they want max performances for small messages, but they could not increase the vsm threshold beyond 33 if their data didn't fit under it. Now they have seven more bytes to play with. I get that 9M msg/sec inproc_thr perf up to 40 bytes now. |
See #3644
Basically the problem we are trying to solve is that the default allocator (malloc/free) is:
@f18m showed that using pre-allocated memory that is re-used performance can be increased significantly. My own use-case is somewhat different as I prefer to have as little dynamic memory allocation in my applications as determinism is an element for me. Allocation on the heap is a source of non-determinism.
I took @f18m pull request and tried to make a a bit nicer to allow users to provide their own queues and made the queue grow dynamically.
Open discussion points:
Todo
RFC
Problem
ZMQ uses quite a bit of dynamic memory allocation. This results in
The first one is not a problem in itself but as one ZMQ strengths is performance it is always good to improve this. The second one is an issue in some systems that require a bound on latency.
Solution direction
The most simple solution would be to allow users to provide a custom allocator that can be used instead of the default allocator. This has the advantage that users can customize the application to their needs and solve the above issues. A user can select the trade-off between memory vs. speed vs. determinism themselves. Of course it adds some disadvantages as well:
To mitigate these somewhat a standard implementation for an allocator should be provided.
Default allocator usage
While there are a quite a few new calls in zmq, Approximately 100 new and 50 malloc calls. Replacing all these is quite a bit of work and most likely not necessary as most of these happen during initialisation. Focus should be on the allocations that happen often and all the time. The best place to look for this is the code that does an allocation for every message sent or received. This results in allocations in:
At the moment (correct me if I'm wrong) the decoder is does not allocate per message but per stream. It seems that the most gain is in the messages, which means the decoder becomes a nice to have (at this point).
APIs
Another option is to set an allocator for all messages (using the context). In my opinion this is a nice to have, and should only be implemented if someone asks for it.
zmq_allocator_new(type)
Construct an allocator of
type
. This is a way to provide default allocators to users. If they want to provide their own custom allocator, in that case this function is not used.zmq_allocator_destroy(allocator)
Destroy an allocator (either from the user or native) after you are done with it. Can only be called when all messages have been handled. This means after terminating the context.
zmq_msg_init_allocator(message, size, allocator)
Initialize a message using a custom allocator. Very similar to
zmq_msg_init_size
except with an argument to pass an allocator.zmg_allocator_t
A struct consisting of function pointers.
void*(*allocate_fn) (void *allocator, size_t len)
Allocate function
void(*deallocate_fn) (void *allocator, void *data_)
Deallocate function
bool(*check_tag_fn) (void *allocator)
Check the tag of an allocator to see if the allocator pointer is valid
void *allocator
Pointer to an allocator itself (on which to use the above functions)
optional (not required for the msg interface):
void(*release_fn) (void *allocator, void *data_)
Release data, this means the allocator is not the owner anymore.
Provided allocators
Default
A basic allocator using new/delete. Basically the same what we have now but with a slight added cost of indirection through a vtable (might be elimated using an interface and marking things as final).
Global pool
This is an allocator which keeps queues of memory chunks based on bin sizes. This reduces the amount of dynamic memory allocation (thus gaining speed) at the cost of reserving more memory than strictly needed. To prevent an unreasonable startup size memory gets allocated on demand, but with the power of two. So if more memory chunks are required than available in a bin the number of chunks in a bin is doubled. If the biggest bin is too large, a bin twice the size is created.
Depending on the workload this way of working can lead to excessive memory use. Scenarios where large burst of same message size, but changing message sizes between bursts, might lead to excessive memory use. If we want to prevent this it might be wise to set a limit to the pre-allocated memory and fall back on new/delete if it is exceeded or return an error. Or we trust the user to not pick this specific allocator.
Other?