-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support free-fragment recycling in shared-segment. Add fingerprint object management. #569
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for splinterdb canceled.
|
@@ -179,7 +179,7 @@ splinterdb_open(splinterdb_config *cfg, splinterdb **kvs); | |||
// Close a splinterdb | |||
// | |||
// This will flush all data to disk and release all resources | |||
void | |||
int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To percolate errors found by shm-destroy, if large-fragments not free are still found hanging around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change. Can you add a comment defining the meaning of the return value? e.g.
"returns 0 on success, non-zero otherwise."
Or
"returns
- 0 on success,
- a positive integer when all data has been persisted but not all resources were able to be released, and
- a negative number to indicate that not all data was able to be persisted and the database was unable to shut down safely."
@@ -380,7 +380,6 @@ void PACKEDARRAY_JOIN(__PackedArray_unpack_, PACKEDARRAY_IMPL_BITS_PER_ITEM)(con | |||
#include "poison.h" | |||
|
|||
#define PACKEDARRAY_MALLOC(size) platform_malloc(size) | |||
#define PACKEDARRAY_FREE(p) platform_free(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused interface.
platform_assert(req->num_tuples < req->max_tuples); | ||
req->fingerprint_arr[req->num_tuples] = | ||
platform_assert(btree_pack_can_fit_tuple(req)); | ||
fingerprint_start(&req->fingerprint)[req->num_tuples] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's where you will start to see the use of fingerprint object and its accessor / interfaces.
uint32 *fingerprint_arr; // IN/OUT: hashes of the keys in the tree | ||
hash_fn hash; // hash function used for calculating filter_hash | ||
unsigned int seed; // seed used for calculating filter_hash | ||
fp_hdr fingerprint; // IN/OUT: hashes of the keys in the tree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The in-place char * array
is now replaced by the fingerprint object, which carries inside of it platform_memfrag{}
handle to track allocate memory fragment's size and to free it reliably.
req->fingerprint_arr = | ||
TYPED_ARRAY_ZALLOC(hid, req->fingerprint_arr, max_tuples); | ||
|
||
fingerprint_init(&req->fingerprint, hid, max_tuples); // Allocates memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline memory allocation on old L345 is, henceforth, replaced by init()'ing the fingerprint object ... And so on ...
"Unable to allocate memory for %lu tuples", | ||
max_tuples); | ||
if (!req->fingerprint_arr) { | ||
if (fingerprint_is_empty(&req->fingerprint)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can no longer check for NULL array ptr, to detect OOM. You must consult the is_empty()
method to figure out if there is memory or not.
if (req->fingerprint_arr) { | ||
platform_free(hid, req->fingerprint_arr); | ||
if (!fingerprint_is_empty(&req->fingerprint)) { | ||
fingerprint_deinit(hid, &req->fingerprint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deinit()
will free memory.
if (!cc->lookup) { | ||
goto alloc_error; | ||
} | ||
cc->lookup_size = memfrag_size(&memfrag_cc_lookup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the first instance of a pair of init
/ deinit
calls, which now need to communicate the size of memory fragment allocated by init()
.
Like it's done on this line, few common structures now gain a new size
field to track the memory fragment's size. These structures are of the kind where they are allocated / init'ed in one function and much later the deinit()
method is called in a separate function.
src/clockcache.c
Outdated
if (cc->lookup) { | ||
platform_free(cc->heap_id, cc->lookup); | ||
memfrag_init_size(mf, cc->lookup, cc->lookup_size); | ||
platform_free(cc->heap_id, mf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
free()
needs to be told the fragment's size correctly. This is obtained from the size
field stashed away when init()
was done.
src/clockcache.c
Outdated
} | ||
if (cc->entry) { | ||
platform_free(cc->heap_id, cc->entry); | ||
memfrag_init_size(mf, cc->entry, cc->entry_size); | ||
platform_free(cc->heap_id, mf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same pattern of changes continues. This will appear in many more instances ...
src/memtable.c
Outdated
platform_memfrag memfrag_ctxt; | ||
platform_memfrag *mf = &memfrag_ctxt; | ||
memfrag_init_size(mf, ctxt, ctxt->mt_ctxt_size); | ||
platform_free(hid, mf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This does look like verbose multi-line repeat code.
@rtjohnso - I did consider whether to add a packaged macro, say, memfrag_init()
, to which you supply the addr / init. Inside the body, we can declare a hidden structures and platform_memfrag *mf
, do the setup, and pass-it as the 2nd arg.
Can be done ... probably .. did not try it too hard. Wanted to get this into review, and then I expect anyway to get comments on this approach.
We can re-discuss the coding impact this approach has ... and re-visit during review.
src/platform_linux/laio.c
Outdated
@@ -287,7 +291,11 @@ io_handle_deinit(laio_handle *io) | |||
} | |||
platform_assert(status == 0); | |||
|
|||
platform_free(io->heap_id, io->req); | |||
platform_memfrag memfrag = {.addr = io->req, .size = io->req_size}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE to myself: This should go away. Currently, memfrag_init_size()
is a #define and this struct is exposed in platform.h
. Rework this so the fields are hidden, and only memfrag_init_size()
is exposed to client code.
This will prevent such naked assignments.
@@ -84,13 +84,16 @@ platform_heap_create(platform_module_id UNUSED_PARAM(module_id), | |||
return STATUS_OK; | |||
} | |||
|
|||
void | |||
platform_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percolating errors upstream from platform_shmdestroy
; see L91 below.
platform_histo_handle hh; | ||
hh = TYPED_MANUAL_MALLOC( | ||
hh = TYPED_ARRAY_MALLOC( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equivalent calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I believe the correct macro for this situation is TYPED_FLEXIBLE_STRUCT_ZALLOC
.
src/platform_linux/platform.h
Outdated
({ \ | ||
debug_assert((n) >= sizeof(*(v))); \ | ||
(typeof(v))platform_aligned_malloc(hid, \ | ||
PLATFORM_CACHELINE_SIZE, \ | ||
(n), \ | ||
(mf), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocation now receives and returns a platform_memfrag{} *
, So, macro's call changes.
src/platform_linux/platform.h
Outdated
@@ -368,13 +371,13 @@ extern platform_heap_id Heap_id; | |||
({ \ | |||
debug_assert((n) >= sizeof(*(v))); \ | |||
(typeof(v))platform_aligned_malloc( \ | |||
hid, (a), (n), STRINGIFY(v), __func__, __FILE__, __LINE__); \ | |||
hid, (a), (n), NULL, STRINGIFY(v), __func__, __FILE__, __LINE__); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most of the consumers, this is good-enough. I could have changed this to require all callers to also declare an on-stack platform_memfrag{}
, but that would be more code changes.
The one 'minor' issue with this is that we might incorrectly free a smaller-sized fragment . But that's not a huge loss, so I went with current solution.
src/platform_linux/platform.h
Outdated
"Attempt to free a NULL ptr from '%s', line=%d", \ | ||
__func__, \ | ||
__LINE__); \ | ||
if (IS_MEM_FRAG(p)) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is key ... and needs understanding. Please review carefully.
src/platform_linux/platform.h
Outdated
const size_t _reqd = \ | ||
(_size + platform_alignment(PLATFORM_CACHELINE_SIZE, _size)); \ | ||
platform_free_mem((hid), (p), _reqd, STRINGIFY(p)); \ | ||
(p) = NULL; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rtjohnso - This line and L907 below is what makes it necessary for clients calling free()
to do two things:
platform_memfrag memfrag;
platform_memfrag *mf;
... Do the initialization ...
platform_free(mf);
I would have liked to skip the mf
and simply pass-in &memfrag
to free()
, but there is a compiler error.
I think this can be fixed with some rework ... but I ran out of energy. Let's review if this can be improved.
src/platform_linux/platform_inline.h
Outdated
platform_do_realloc(const platform_heap_id heap_id, | ||
const size_t oldsize, | ||
void *ptr, // IN | ||
size_t *newsize, // IN/OUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reallocation now returns the *newsize
, so clients like writable buffer resize can record the new fragment's size in its buffer_capacity
field.
This, then, allows writable_buffer_deinit()
to correctly supply the newly realloc'ed fragment's size to free
.
void *retptr = (heap_id ? splinter_shm_alloc(heap_id, required, objname) | ||
: aligned_alloc(alignment, required)); | ||
void *retptr = NULL; | ||
if (heap_id == PROCESS_PRIVATE_HEAP_ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified usage of the semantic of heap_id
; NULL means process-private-heap-ID, so go thru old malloc()
-style code-flow.
src/platform_linux/shmem.c
Outdated
int frag_allocated_to_pid; // Allocated to this OS-pid | ||
int frag_freed_by_pid; // OS-pid that freed this fragment | ||
threadid frag_freed_by_tid; // Splinter thread-ID that freed this | ||
int frag_line; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changed, indented and aligned fields for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated: Nothing changed ... only indentation changes ...
# define SHM_LARGE_FRAG_SIZE (90 * KiB) | ||
#else | ||
# define SHM_LARGE_FRAG_SIZE (38 * KiB) | ||
#endif // SPLINTER_DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff in this limit was causing unit-tests to fail in debug builds and pass in release build.
This artifact is leftover from my poc-dev days ... no longer should be needed to separate out limits.
Have stabilized on 32K as lower limit for large fragments in both builds. Moved to shmem.h
typedef struct free_frag_hdr { | ||
struct free_frag_hdr *free_frag_next; | ||
size_t free_frag_size; | ||
} free_frag_hdr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used to chain free'd-fragments that are returned to the free-list. This tiny struct lives at the head of each free fragment.
Min frag-size is 64 bytes, so we have room.
* can print these after shared segment has been destroyed. | ||
* ------------------------------------------------------------------------ | ||
*/ | ||
typedef struct shminfo_usage_stats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consolidated all metrics, usage-stats into this common struct.
It will be update in-place nested in the shmem control block.
And will also be used to return metrics when shared memory is being dismantled.
Works much better this way!!
src/platform_linux/shmem.c
Outdated
@@ -198,13 +309,13 @@ platform_shm_hip(platform_heap_id hid) | |||
static inline void | |||
shm_lock_mem_frags(shmem_info *shminfo) | |||
{ | |||
platform_spin_lock(&shminfo->shm_mem_frags_lock); | |||
platform_mutex_lock(&shminfo->shm_mem_frags_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better solution.... Otherwise, for some cases of new large_inserts_stress_test
, we were simply burning up 100% CPU.
In all workloads, pthread-semaphore is less than 5%, sometimes even 1-2% as seen in perf top
.
src/platform_linux/shmem.c
Outdated
platform_save_usage_stats(shminfo_usage_stats *usage, shmem_info *shminfo) | ||
{ | ||
*usage = shminfo->usage; | ||
usage->large_frags_found_in_use = platform_trace_large_frags(shminfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much simpler rather than the line-by-line copy that old code from L303 onwards was doing.
src/platform_linux/shmem.c
Outdated
} else { | ||
// Try to satisfy small memory fragments based on requested size, from | ||
// cached list of free-fragments. | ||
retptr = platform_shm_find_frag(shminfo, size, objname, func, file, line); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New find
method to find small free fragments, thru its free-list-by-size.
{ | ||
((free_frag_hdr *)ptr)->free_frag_next = *here; | ||
((free_frag_hdr *)ptr)->free_frag_size = size; | ||
*here = ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inserting free'd- fragment at the head of its free-list
Upcoming PR #569 is overhauling large-inserts stress test. To simplify examining the diffs of this test case as part of that review, this commit is renaming the test file to large_inserts_stress_test.c, with appropriate changes to the build Makefile and test files, to pickup new file.
Upcoming PR #569 is overhauling large-inserts stress test. To simplify examining the diffs of this test case as part of that review, this commit is renaming the test file to large_inserts_stress_test.c . Make appropriate changes to the build Makefile and test files, to pickup new file.
Upcoming PR #569 is overhauling large-inserts stress test. To simplify examining the diffs of this test case as part of that review, this commit is renaming the test file to large_inserts_stress_test.c . Make appropriate changes to the build Makefile and test files, to pickup new file.
Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for upcoming small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() ... No other code-/logic-changes are done with this commit.
Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for code changes coming from small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() NOTE: No other code-/logic-changes are done with this commit.
Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for code changes coming from small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() NOTE: No other code-/logic-changes are done with this commit.
Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for code changes coming from small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() NOTE: No other code-/logic-changes are done with this commit.
This commit refactors shared memory usage stats fields to drive-off shminfo_usage_stats{} struct entirely. Add platform_save_usage_stats(), used by platform_shm_print_usage(). This refactoring paves the way for upcoming PR #569 which is adding more memory-usage stats fields.
This commit refactors shared memory usage stats fields to drive-off shminfo_usage_stats{} struct entirely. Add platform_save_usage_stats(), used by platform_shm_print_usage(). This refactoring paves the way for upcoming PR #569 which is adding more memory-usage stats fields.
6c24747
to
33a95f2
Compare
@rtjohnso - The final part-3 shared memory support change-set is now ready for review. The suggested order in which to review these diffs is:
|
I think the current memfrag interface is leaky and not general. I think the interface should look like this:
(Note that details, like the exact names of the functions or the memfrag datatype are not too important in this example.) The point is that the rest of the code should treat memfrags as opaque objects. In the current code, the rest of the code goes around pulling out fields and saving them for later use. It means that internal details of the current allocator implementation are being leaked all over the rest of the code. This will make it difficult to change the allocator implementation down the road. As for names, I would advocate renaming memfrag to |
HI, @rtjohnso -- Thanks for your initial approach on reworking the interfaces. I'm happy to take this further, but I feel this round-trip discussion will become long and meandering. And this review panel UI exchange is not ideally suited for that kind of interaction. I want to avoid re-doing the implementation till we've settled on and agreed to the new interfaces. Every bit of code rework requires massively editing the change-set and re-stabilizing - an effort I would like to avoid doing multiple times. How about I start a new thread under Discussions tab, with your initial proposal? And, will give you my responses, rebuttal. I suspect we will have to go back-and-forth a few times before settling on the final interfaces. (As a team, we haven't used the Discussions tab feature internally. As I am beginning my transition to fully out-of-VMware, it may be a good opportunity to engage using this GitHub feature, so it continues even when I'm a fully O-Sourced' engineer.) |
…ject mgmt The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits. Another driving factor for implementing some free-list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point, and we'd frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle). The main design changes to manage small-fragments are follows: Managing memory allocation / free using platform_memfrag{} fragments: - Allocation and free of memory is dealt with in terms of "memory fragments", a small structure that holds the memory->{addr, size}. All memory requests (as is being done previously) are aligned to the cacheline. - Allocation: All clients of memory allocation have to "hand-in" an opaque platform_memfrag{} handle, which will be returned populated with the memory address, and more importantly, the size-of-the-fragment that was used to satisfy the memory request. - Free: Clients now have to safely keep a handle to this returned platform_memfrag{}, and hand it back to the free() method. free() will rely "totally" on the size specified in this input fragment handle supplied. And the free'd memory fragment will be returned to the corresponding free-list bucket. - Upon free(), the freed-fragment is tracked in a few free-lists bucketed by size of the freed-fragment. For now, we support 4 buckets, size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient for current benchmarking requirements.) A free'd fragment is hung off of the corresponding list, threading the free-fragments using the fragment's memory itself. New struct free_frag_hdr{} provides the threading structure. It tracks the current fragment's size and free_frag_next pointer. The 'size' provided to the free() call is is recorded as the free'd fragment's size. - Subsequently, a new alloc() request is 1st satisfied by searching the free-list corresponding to the memory request. For example, a request from a client for 150 bytes will be rounded-up a cacheline boundary, i.e. 192 bytes. The free-list for bucket 256 bytes will be searched to find the 1st free-fragment of the right size. If no free fragment is found in the target list, we then allocate a new fragment. The returned fragment will have a size of 256 (for an original request of 150 bytes). - An immediate consequence of this approach is that there is a small, but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(), TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z' equivalents, which return 0'ed out memory. - All existing clients of the various TYPED*() memory allocation calls have been updated to declare an on-stack platform_memfrag{} handle, which is passed back to platform_free(). - In some places memory is allocated to initialize sub-systems and then torn down during deinit(). In a few places existing structures are extended to track an additional 'size' field. The size of the memory fragment allocated during init() is recorded here, and then used to invoke platform_free() as part of the deinit() method. An example is clockcache_init() where this kind of work to record the 'size' of the fragment is done and passed-down to clockcache_deinit(), where the memory fragment is then freed with the right 'size'. This pattern is now to be seen in many such init()/deinit() methods of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ... - Cautionary Note: If the 'ptr' handed to platform_free() is not of type platform_memfrag{} *, it is treated as a generic <struct> *, and its sizeof() will be used as the 'size' of the fragment to free. This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.) - Copious debug and platform asserts have been added in shmem alloc/free methods to cross-check to some extent illegal calls. Fingerprint Object Management: Managing memory for fingerprint arrays was particularly problematic. This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process. The requirements from the new interfaces are handled as follows: - Added a new fingerprint{} object, struct fp_hdr{}, which embeds at its head a platform_memfrag{}. And few other short fields are added for tracking fingerprint memory mgmt gyrations. - Various accessor methods are added to manage memory for fingerprint arrays through this object. E.g., - fingerprint_init() - Will allocate required fingerprint for 'ntuples'. - fingerprint_deinit() - Will dismantle object and free the memory - fingerprint_start() - Returns start of fingerprint array's memory - fingerprint_nth() - Returns n'th element of fingerprint Packaging the handling of fingerprint array through this object and its interfaces helped greatly to stabilize the memory histrionics. - When SplinterDB is closed, shared memory dismantling routine will tag any large-fragments that are still found "in-use". This is percolated all the way back to splinterdb_close(), unmount() and to platform_heap_destory() as a failure $rc. Test will fail if they have left some un-freed large fragments. (Similar approach was considered to book-keep all small fragments used/freed, but due to some rounding errors, it cannot be a reliable check at this time. So hasn't been done.) Test changes: Miscellaneous: - Elaborate and illustrative tracing added to track memory mgmt done for fingerprint arrays, especially when they are bounced around queued / re-queued tasks. (Was a very problematic debugging issue.) - Extended tests to exercise core memory allocation / free APIs, and to exercise fingerprint object mgmt, and writable_buffer interfaces: - platform_apis_test: - splinter_shmem_test.c: Adds specific test-cases to verify that free-list mgmt is happening correctly. - Enhanced various diagnostics, asserts, tracing - Improved memory usage stats gathering and reporting - Added hooks to cross-check multiple-frees of fragments, and testing hooks to verify if a free'd fragment is relocated to the right free-list - Add diagram for large-free fragment tracking.
This commit reworks the interfaces on the lines discussed in this discussion thread: #615 void * platform_alloc(memfrag *mf, int size, ...); void * platform_realloc(memfrag *mf, int newsize); void platform_free(memfrag *mf); // IN Currently, the return from `platform_free` still remains as void. Changing it to platform_status will mean plumbing the return handling to all callers. Also C-system call `free()` is defined as `void`. So changing `platform_free` to return platform_status will be a bit inconsistent.
3588f3e
to
9281c83
Compare
@rtjohnso - My CI-stabilization jobs have succeeded. I have squashed all changes arising from our proposal discussion thread into this one single commit and have refreshed this change-set. You can restart your review on this amended change-set. (I expect CI-jobs will succeed as they did in the stabilization PR #616 ) |
@rtjohnso : Fyi -- I want to log this one ASAN-instability the most recent round of CI-jobs ran into, as I am not going to remember all this later. Here is the state of affairs and results of my investigations.
The last variation of this test in manual repro attempts I tried is 4 concurrent invocations of this test: (Logging this here so I can refer to this later on.)
The VM has 16 vCPUs, so I figured by running with 8 insert-threads and 4 concurrent instances, we'd load the CPU high-enough to tickle any bugs out. But the ASAN problem did not recur in these manual repro attempts. NOTE: In the original failure in CI, hard to tell exactly, but it seems like the thread ID 2 ran into the ASAN memory over flow and soon after, thread ID=3 ran into this assertion a few lines later:
You may recall that I had reported issue #474 some time ago for this trunk bundle mgmt assertion. I suspect that there is something lurking there that popped up in the CI-run. I cannot explain how / whether / if this assertion tripping is caused by the ASAN heap-buffer-overflow error or if they are even related. Unfortunately, I could not repro the ASAN issue outside CI, so have to give up on this investigation now. The rest of the test runs are stable, and this ASAN-job did succeed on a re-run. I have re-reviewed the code-diffs applied recently and could not find anything obviously broken. For now, I will have to conclude that the changes are fine except there may be some hidden instability popping up, possibly triggered by issue #474 mentioned earlier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mostly just gone through the headers in the platform
code, plus the fingerprint array api.
I assume once we get these nailed down, then most of the changes in the rest of the code will be relatively straightforward updates to the new apis.
Or is there anything else major?
Let's get the new apis sorted and then I can review the whole PR.
@@ -179,7 +179,7 @@ splinterdb_open(splinterdb_config *cfg, splinterdb **kvs); | |||
// Close a splinterdb | |||
// | |||
// This will flush all data to disk and release all resources | |||
void | |||
int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change. Can you add a comment defining the meaning of the return value? e.g.
"returns 0 on success, non-zero otherwise."
Or
"returns
- 0 on success,
- a positive integer when all data has been persisted but not all resources were able to be released, and
- a negative number to indicate that not all data was able to be persisted and the database was unable to shut down safely."
@@ -3167,8 +3173,8 @@ btree_pack_loop(btree_pack_req *req, // IN/OUT | |||
log_trace_key(tuple_key, "btree_pack_loop (bottom)"); | |||
|
|||
if (req->hash) { | |||
platform_assert(req->num_tuples < req->max_tuples); | |||
req->fingerprint_arr[req->num_tuples] = | |||
platform_assert(btree_pack_can_fit_tuple(req)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
platform_histo_handle hh; | ||
hh = TYPED_MANUAL_MALLOC( | ||
hh = TYPED_ARRAY_MALLOC( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I believe the correct macro for this situation is TYPED_FLEXIBLE_STRUCT_ZALLOC
.
#define TYPED_ALIGNED_ZALLOC(hid, a, v, n) \ | ||
|
||
#define TYPED_ALIGNED_MALLOC(hid, a, v, n) \ | ||
TYPED_ALIGNED_MALLOC_MF(&memfrag_##v, hid, a, v, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate this hard-coding of this naming convention, and I think it doesn't solve an important problem.
Let me propose something that might be more useful.
#define TYPED_ALIGNED_MALLOC_AUTO(hid, a, v, n) \
platform_memfrag v_##memfrag __attribute__(cleanup(platform_free));
TYPED_ALIGNED_MALLOC_MF(&v_##memfrag, hid, a, v, n)
This allocates memory that will automatically get freed when the function exits the scope of the allocation.
Furthermore, it alleviates the user of the responsibility to declare a memfrag at all.
* Utility macro to test if an argument to platform_free() is a | ||
* platform_memfrag *. | ||
*/ | ||
#define IS_MEM_FRAG(x) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will go away, right?
__FILE__, \ | ||
__LINE__); \ | ||
_mf->addr = NULL; \ | ||
_mf->size = 0; \ | ||
} while (0) | ||
|
||
// Convenience function to free something volatile | ||
static inline void | ||
platform_free_volatile_from_heap(platform_heap_id heap_id, | ||
volatile void *ptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change to accept a memfrag, right?
@@ -40,10 +40,11 @@ platform_checksum_is_equal(checksum128 left, checksum128 right) | |||
static void | |||
platform_free_from_heap(platform_heap_id UNUSED_PARAM(heap_id), | |||
void *ptr, | |||
const size_t size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be modified to take a memfrag, right?
platform_free_volatile_from_heap( \ | ||
id, (p), STRINGIFY(p), __func__, __FILE__, __LINE__); \ | ||
(p) = NULL; \ | ||
debug_assert(((p) != NULL), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you change platform_free_volatile_from_heap
to take a memfrag, you won't need all this type checking.
void *retptr = NULL; | ||
if (heap_id == PROCESS_PRIVATE_HEAP_ID) { | ||
retptr = aligned_alloc(alignment, required); | ||
if (memfrag) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of the if
. Always require a memfrag parameter.
static inline void | ||
platform_free_from_heap(platform_heap_id heap_id, | ||
void *ptr, | ||
const size_t size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a memfrag instead.
@rtjohnso - I've gone thru your review comments quickly. Most of those are easily implementable. I will get to it.
I am curious about your review of the fingerprint array API rework. Did you not find any issues with that? I was bracing myself to get lots of comments as this area is fragile and the rework is a bit tricky. If you think this array API is acceptable, then that will reduce a bunch of rework rounds on me.
Let me apply the changes requested and then re-test. (CI-re-test stabilization will be a nightmare starting tomorrow.) Once I go over all the changes, I will be better able to answer this question of yours:
... for which the answer now is, I don't think so, off-hand. |
I left a few comments on the fingerprint array code already. I haven't done a full evaluation. It seemed more complex than I expected, but I see that it is trying to make explicit some of the complex sharing that goes on with the fingerprint arrays, which is a goal I like. I will want to do a more thorough review of how it is used to understand how it all fits together. |
I spoke with Alex today about the overall design, and he really doesn't like how the whole concept of memfrags puts a burden on the rest of the code. So let's do the following. Whenever the shm code allocates memory, it allocates one extra cache line in front, and stores the memfrag on that cacheline. Later, during a free, you use pointer arithmetic to find the memfrag for that pointer. |
The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits.
Another driving factor for implementing free-fragment list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point. We would frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle).
Design Overview
The main design changes to manage small-fragments are follows:
Managing memory allocation / free using
platform_memfrag{}
fragmentsAllocation and free of memory is dealt with in terms of "memory fragments", a small structure that holds the
memory->{addr, size}
. All memory requests (as is being done previously) are aligned tothe cacheline.
Allocation: All clients of memory allocation have to "hand-in" an opaque
platform_memfrag{}
handle, which will be returned populated with the memory address, and more importantly, the size-of-the-fragment that was used to satisfy the memory request.Free: Clients now have to safely keep a handle to this returned
platform_memfrag{}
, and hand it back to thefree()
method.free()
will rely "totally" on the size specified in this input fragment handle supplied. And the free'd memory fragment will be returned to the corresponding free-list bucket, if the fragment's size is one in a small set of free-fragments being tracked.Upon free(), the freed-fragment is tracked in a few free-lists bucketed by size of the freed-fragment. For now, we support 4 buckets, size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient
for current benchmarking requirements.)
A free'd fragment is hung off of the corresponding list, threading the free-fragments using the fragment's memory itself.
New
struct free_frag_hdr{}
provides the threading structure. It tracks the current fragment's size and free_frag_next pointer. The 'size' provided to thefree()
call is is recorded as the free'd fragment's size.Subsequently, a new alloc() request is 1st satisfied by searching the free-list corresponding to the memory request.
For example, a request from a client for 150 bytes will be rounded-up to a cacheline boundary,
i.e. 192 bytes. The free-list for bucket 256 bytes will be searched to find the 1st free-fragment of the right size. If no free fragment is found in the target list, we then allocate a new fragment. The returned fragment will have a size of 256 (for an original request of 150 bytes).
An immediate consequence of this approach is that there is a small, but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(), TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z' equivalents, which return 0'ed out memory.
All existing clients of the various TYPED*() memory allocation calls have been updated to declare an on-stack
platform_memfrag{}
handle, which is passed back toplatform_free()
.In some places memory is allocated to initialize sub-systems and then torn down during deinit(). In a few places existing structures are extended to track an additional 'size' field. The size of the memory fragment allocated during init() is recorded here, and then used to invoke platform_free() as part of the deinit() method.
clockcache_init()
where this kind of work to record the 'size' of the fragment is done and passed-down toclockcache_deinit()
, where the memory fragment is then freed with the right 'size'.This pattern is now to be seen in many such init()/deinit() methods of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ...
Copious debug and platform asserts have been added in shmem alloc/free methods to cross-check to some extent illegal calls.
Cautionary Note
If the 'ptr' handed to platform_free() is not of type platform_memfrag{} *, it is treated as a generic *, and its sizeof() will be used as the 'size' of the fragment to free.
This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if
we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.)
Fingerprint Object Management
Managing memory for fingerprint arrays was particularly problematic.
This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process.
The requirements from the new interfaces are handled as follows:
Added a new fingerprint{} object, struct fp_hdr{}, which embeds at its head a platform_memfrag{}. And few other short fields are added for tracking fingerprint memory mgmt gyrations.
Various accessor methods are added to manage memory for fingerprint arrays through this object.
E.g.,
Packaging the handling of fingerprint array through this object and its interfaces helped greatly to stabilize the memory histrionics.
platform_heap_destory() as a failure $rc. Test will fail if they have left some un-freed large fragments.
(Similar approach was considered to book-keep all small fragments used/freed, but due to some rounding errors, it cannot be a reliable check at this time. So hasn't been done.)
Test changes
Miscellaneous
Elaborate and illustrative tracing added to track memory mgmt done for fingerprint arrays, especially when they are bounced around queued / re-queued tasks. (Was a very problematic debugging issue.)
Extended tests to exercise core memory allocation / free APIs, and to exercise fingerprint object mgmt, and writable_buffer interfaces:
Enhanced various diagnostics, asserts, tracing
Improved memory usage stats gathering and reporting
Added hooks to cross-check multiple-frees of fragments, and testing hooks to verify if a free'd fragment is relocated to the right free-list