Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

gapisback · 2023-04-14T23:29:15Z

The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits.

Another driving factor for implementing free-fragment list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point. We would frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle).

Design Overview

The main design changes to manage small-fragments are follows:

Managing memory allocation / free using `platform_memfrag{}` fragments

Allocation and free of memory is dealt with in terms of "memory fragments", a small structure that holds the memory->{addr, size}. All memory requests (as is being done previously) are aligned to
the cacheline.
- Allocation: All clients of memory allocation have to "hand-in" an opaque platform_memfrag{} handle, which will be returned populated with the memory address, and more importantly, the size-of-the-fragment that was used to satisfy the memory request.
- Free: Clients now have to safely keep a handle to this returned platform_memfrag{}, and hand it back to the free() method. free() will rely "totally" on the size specified in this input fragment handle supplied. And the free'd memory fragment will be returned to the corresponding free-list bucket, if the fragment's size is one in a small set of free-fragments being tracked.
Upon free(), the freed-fragment is tracked in a few free-lists bucketed by size of the freed-fragment. For now, we support 4 buckets, size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient
for current benchmarking requirements.)

A free'd fragment is hung off of the corresponding list, threading the free-fragments using the fragment's memory itself.
New struct free_frag_hdr{} provides the threading structure. It tracks the current fragment's size and free_frag_next pointer. The 'size' provided to the free() call is is recorded as the free'd fragment's size.
Subsequently, a new alloc() request is 1st satisfied by searching the free-list corresponding to the memory request.

For example, a request from a client for 150 bytes will be rounded-up to a cacheline boundary,
i.e. 192 bytes. The free-list for bucket 256 bytes will be searched to find the 1st free-fragment of the right size. If no free fragment is found in the target list, we then allocate a new fragment. The returned fragment will have a size of 256 (for an original request of 150 bytes).

An immediate consequence of this approach is that there is a small, but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(), TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z' equivalents, which return 0'ed out memory.
All existing clients of the various TYPED*() memory allocation calls have been updated to declare an on-stack platform_memfrag{} handle, which is passed back to platform_free().
In some places memory is allocated to initialize sub-systems and then torn down during deinit(). In a few places existing structures are extended to track an additional 'size' field. The size of the memory fragment allocated during init() is recorded here, and then used to invoke platform_free() as part of the deinit() method.
- An example is clockcache_init() where this kind of work to record the 'size' of the fragment is done and passed-down to clockcache_deinit(), where the memory fragment is then freed with the right 'size'.
This pattern is now to be seen in many such init()/deinit() methods of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ...
Copious debug and platform asserts have been added in shmem alloc/free methods to cross-check to some extent illegal calls.

Cautionary Note

If the 'ptr' handed to platform_free() is not of type platform_memfrag{} *, it is treated as a generic *, and its sizeof() will be used as the 'size' of the fragment to free.

This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if
we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.)

Fingerprint Object Management

Managing memory for fingerprint arrays was particularly problematic.

This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process.

The requirements from the new interfaces are handled as follows:

Added a new fingerprint{} object, struct fp_hdr{}, which embeds at its head a platform_memfrag{}. And few other short fields are added for tracking fingerprint memory mgmt gyrations.
Various accessor methods are added to manage memory for fingerprint arrays through this object.
E.g.,
- fingerprint_init() - Will allocate required fingerprint for 'ntuples'.
- fingerprint_deinit() - Will dismantle object and free the memory
- fingerprint_start() - Returns start of fingerprint array's memory
- fingerprint_nth() - Returns n'th element of fingerprint

Packaging the handling of fingerprint array through this object and its interfaces helped greatly to stabilize the memory histrionics.

When SplinterDB is closed, shared memory dismantling routine will tag any large-fragments that are still found "in-use". This is percolated all the way back to splinterdb_close(), unmount() and to
platform_heap_destory() as a failure $rc. Test will fail if they have left some un-freed large fragments.
(Similar approach was considered to book-keep all small fragments used/freed, but due to some rounding errors, it cannot be a reliable check at this time. So hasn't been done.)

Test changes

Miscellaneous

Elaborate and illustrative tracing added to track memory mgmt done for fingerprint arrays, especially when they are bounced around queued / re-queued tasks. (Was a very problematic debugging issue.)
Extended tests to exercise core memory allocation / free APIs, and to exercise fingerprint object mgmt, and writable_buffer interfaces:
- platform_apis_test:
- splinter_shmem_test.c: Adds specific test-cases to verify that free-list mgmt is happening correctly.
Enhanced various diagnostics, asserts, tracing
Improved memory usage stats gathering and reporting
Added hooks to cross-check multiple-frees of fragments, and testing hooks to verify if a free'd fragment is relocated to the right free-list

netlify · 2023-04-14T23:29:21Z

✅ Deploy Preview for splinterdb canceled.

Name	Link
🔨 Latest commit	`9281c83`
🔍 Latest deploy log	https://app.netlify.com/sites/splinterdb/deploys/65bb050c366a1500096d0cad

gapisback · 2023-04-14T23:35:06Z

include/splinterdb/splinterdb.h

@@ -179,7 +179,7 @@ splinterdb_open(splinterdb_config *cfg, splinterdb **kvs);
 // Close a splinterdb
 //
 // This will flush all data to disk and release all resources
-void
+int


To percolate errors found by shm-destroy, if large-fragments not free are still found hanging around.

Good change. Can you add a comment defining the meaning of the return value? e.g.

"returns 0 on success, non-zero otherwise."

Or

"returns

0 on success,

a positive integer when all data has been persisted but not all resources were able to be released, and

a negative number to indicate that not all data was able to be persisted and the database was unable to shut down safely."

gapisback · 2023-04-14T23:35:17Z

src/PackedArray.c

@@ -380,7 +380,6 @@ void PACKEDARRAY_JOIN(__PackedArray_unpack_, PACKEDARRAY_IMPL_BITS_PER_ITEM)(con
 #include "poison.h"

 #define PACKEDARRAY_MALLOC(size) platform_malloc(size)
-#define PACKEDARRAY_FREE(p) platform_free(p)


Unused interface.

gapisback · 2023-04-14T23:36:07Z

src/btree.c

-      platform_assert(req->num_tuples < req->max_tuples);
-      req->fingerprint_arr[req->num_tuples] =
+      platform_assert(btree_pack_can_fit_tuple(req));
+      fingerprint_start(&req->fingerprint)[req->num_tuples] =


Here's where you will start to see the use of fingerprint object and its accessor / interfaces.

gapisback · 2023-04-14T23:37:05Z

src/btree.h

-   uint32       *fingerprint_arr; // IN/OUT: hashes of the keys in the tree
+   hash_fn       hash;        // hash function used for calculating filter_hash
+   unsigned int  seed;        // seed used for calculating filter_hash
+   fp_hdr        fingerprint; // IN/OUT: hashes of the keys in the tree


The in-place char * array is now replaced by the fingerprint object, which carries inside of it platform_memfrag{} handle to track allocate memory fragment's size and to free it reliably.

gapisback · 2023-04-14T23:38:20Z

src/btree.h

-      req->fingerprint_arr =
-         TYPED_ARRAY_ZALLOC(hid, req->fingerprint_arr, max_tuples);
+
+      fingerprint_init(&req->fingerprint, hid, max_tuples); // Allocates memory


Inline memory allocation on old L345 is, henceforth, replaced by init()'ing the fingerprint object ... And so on ...

gapisback · 2023-04-14T23:39:03Z

src/btree.h

                   "Unable to allocate memory for %lu tuples",
                   max_tuples);
-      if (!req->fingerprint_arr) {
+      if (fingerprint_is_empty(&req->fingerprint)) {


You can no longer check for NULL array ptr, to detect OOM. You must consult the is_empty() method to figure out if there is memory or not.

gapisback · 2023-04-14T23:39:26Z

src/btree.h

-   if (req->fingerprint_arr) {
-      platform_free(hid, req->fingerprint_arr);
+   if (!fingerprint_is_empty(&req->fingerprint)) {
+      fingerprint_deinit(hid, &req->fingerprint);


deinit() will free memory.

gapisback · 2023-04-14T23:40:35Z

src/clockcache.c

   if (!cc->lookup) {
      goto alloc_error;
   }
+   cc->lookup_size = memfrag_size(&memfrag_cc_lookup);


Here's the first instance of a pair of init / deinit calls, which now need to communicate the size of memory fragment allocated by init().

Like it's done on this line, few common structures now gain a new size field to track the memory fragment's size. These structures are of the kind where they are allocated / init'ed in one function and much later the deinit() method is called in a separate function.

gapisback · 2023-04-14T23:41:28Z

src/clockcache.c

   if (cc->lookup) {
-      platform_free(cc->heap_id, cc->lookup);
+      memfrag_init_size(mf, cc->lookup, cc->lookup_size);
+      platform_free(cc->heap_id, mf);


free() needs to be told the fragment's size correctly. This is obtained from the size field stashed away when init() was done.

gapisback · 2023-04-14T23:42:00Z

src/clockcache.c

   }
   if (cc->entry) {
-      platform_free(cc->heap_id, cc->entry);
+      memfrag_init_size(mf, cc->entry, cc->entry_size);
+      platform_free(cc->heap_id, mf);


Same pattern of changes continues. This will appear in many more instances ...

gapisback · 2023-04-14T23:45:08Z

src/memtable.c

+   platform_memfrag  memfrag_ctxt;
+   platform_memfrag *mf = &memfrag_ctxt;
+   memfrag_init_size(mf, ctxt, ctxt->mt_ctxt_size);
+   platform_free(hid, mf);


NOTE: This does look like verbose multi-line repeat code.

@rtjohnso - I did consider whether to add a packaged macro, say, memfrag_init(), to which you supply the addr / init. Inside the body, we can declare a hidden structures and platform_memfrag *mf, do the setup, and pass-it as the 2nd arg.

Can be done ... probably .. did not try it too hard. Wanted to get this into review, and then I expect anyway to get comments on this approach.

We can re-discuss the coding impact this approach has ... and re-visit during review.

gapisback · 2023-04-14T23:46:43Z

src/platform_linux/laio.c

@@ -287,7 +291,11 @@ io_handle_deinit(laio_handle *io)
   }
   platform_assert(status == 0);

-   platform_free(io->heap_id, io->req);
+   platform_memfrag  memfrag = {.addr = io->req, .size = io->req_size};


NOTE to myself: This should go away. Currently, memfrag_init_size() is a #define and this struct is exposed in platform.h. Rework this so the fields are hidden, and only memfrag_init_size() is exposed to client code.

This will prevent such naked assignments.

gapisback · 2023-04-14T23:47:18Z

src/platform_linux/platform.c

@@ -84,13 +84,16 @@ platform_heap_create(platform_module_id    UNUSED_PARAM(module_id),
   return STATUS_OK;
 }

-void
+platform_status


Percolating errors upstream from platform_shmdestroy; see L91 below.

gapisback · 2023-04-14T23:47:55Z

src/platform_linux/platform.c

   platform_histo_handle hh;
-   hh = TYPED_MANUAL_MALLOC(
+   hh = TYPED_ARRAY_MALLOC(


Equivalent calls.

Actually, I believe the correct macro for this situation is TYPED_FLEXIBLE_STRUCT_ZALLOC.

gapisback · 2023-04-14T23:49:07Z

src/platform_linux/platform.h

   ({                                                                          \
      debug_assert((n) >= sizeof(*(v)));                                       \
      (typeof(v))platform_aligned_malloc(hid,                                  \
                                         PLATFORM_CACHELINE_SIZE,              \
                                         (n),                                  \
+                                         (mf),                                 \


Allocation now receives and returns a platform_memfrag{} *, So, macro's call changes.

gapisback · 2023-04-14T23:50:30Z

src/platform_linux/platform.h

@@ -368,13 +371,13 @@ extern platform_heap_id     Heap_id;
   ({                                                                          \
      debug_assert((n) >= sizeof(*(v)));                                       \
      (typeof(v))platform_aligned_malloc(                                      \
-         hid, (a), (n), STRINGIFY(v), __func__, __FILE__, __LINE__);           \
+         hid, (a), (n), NULL, STRINGIFY(v), __func__, __FILE__, __LINE__);     \


For most of the consumers, this is good-enough. I could have changed this to require all callers to also declare an on-stack platform_memfrag{}, but that would be more code changes.

The one 'minor' issue with this is that we might incorrectly free a smaller-sized fragment . But that's not a huge loss, so I went with current solution.

gapisback · 2023-04-14T23:52:01Z

src/platform_linux/platform.h

+                      "Attempt to free a NULL ptr from '%s', line=%d",         \
+                      __func__,                                                \
+                      __LINE__);                                               \
+      if (IS_MEM_FRAG(p)) {                                                    \


This change is key ... and needs understanding. Please review carefully.

gapisback · 2023-04-14T23:55:06Z

src/platform_linux/platform.h

+         const size_t _reqd =                                                  \
+            (_size + platform_alignment(PLATFORM_CACHELINE_SIZE, _size));      \
+         platform_free_mem((hid), (p), _reqd, STRINGIFY(p));                   \
+         (p) = NULL;                                                           \


@rtjohnso - This line and L907 below is what makes it necessary for clients calling free() to do two things:

platform_memfrag memfrag; platform_memfrag *mf; ... Do the initialization ... platform_free(mf);

I would have liked to skip the mf and simply pass-in &memfrag to free(), but there is a compiler error.

I think this can be fixed with some rework ... but I ran out of energy. Let's review if this can be improved.

gapisback · 2023-04-14T23:57:03Z

src/platform_linux/platform_inline.h

+platform_do_realloc(const platform_heap_id heap_id,
+                    const size_t           oldsize,
+                    void                  *ptr,     // IN
+                    size_t                *newsize, // IN/OUT


Reallocation now returns the *newsize, so clients like writable buffer resize can record the new fragment's size in its buffer_capacity field.

This, then, allows writable_buffer_deinit() to correctly supply the newly realloc'ed fragment's size to free.

gapisback · 2023-04-14T23:58:23Z

src/platform_linux/platform_inline.h

-   void *retptr = (heap_id ? splinter_shm_alloc(heap_id, required, objname)
-                           : aligned_alloc(alignment, required));
+   void *retptr = NULL;
+   if (heap_id == PROCESS_PRIVATE_HEAP_ID) {


Clarified usage of the semantic of heap_id; NULL means process-private-heap-ID, so go thru old malloc()-style code-flow.

gapisback · 2023-04-14T23:59:09Z

src/platform_linux/shmem.c

+   int         frag_allocated_to_pid; // Allocated to this OS-pid
+   int         frag_freed_by_pid;     // OS-pid that freed this fragment
+   threadid    frag_freed_by_tid;     // Splinter thread-ID that freed this
+   int         frag_line;


No changed, indented and aligned fields for readability.

Updated: Nothing changed ... only indentation changes ...

gapisback · 2023-04-15T00:00:47Z

src/platform_linux/shmem.c

-#   define SHM_LARGE_FRAG_SIZE (90 * KiB)
-#else
-#   define SHM_LARGE_FRAG_SIZE (38 * KiB)
-#endif // SPLINTER_DEBUG


Diff in this limit was causing unit-tests to fail in debug builds and pass in release build.

This artifact is leftover from my poc-dev days ... no longer should be needed to separate out limits.

Have stabilized on 32K as lower limit for large fragments in both builds. Moved to shmem.h

gapisback · 2023-04-15T00:01:39Z

src/platform_linux/shmem.c

+typedef struct free_frag_hdr {
+   struct free_frag_hdr *free_frag_next;
+   size_t                free_frag_size;
+} free_frag_hdr;


Used to chain free'd-fragments that are returned to the free-list. This tiny struct lives at the head of each free fragment.

Min frag-size is 64 bytes, so we have room.

gapisback · 2023-04-15T00:02:50Z

src/platform_linux/shmem.c

+ * can print these after shared segment has been destroyed.
+ * ------------------------------------------------------------------------
+ */
+typedef struct shminfo_usage_stats {


Consolidated all metrics, usage-stats into this common struct.

It will be update in-place nested in the shmem control block.

And will also be used to return metrics when shared memory is being dismantled.

Works much better this way!!

gapisback · 2023-04-15T00:04:05Z

src/platform_linux/shmem.c

@@ -198,13 +309,13 @@ platform_shm_hip(platform_heap_id hid)
 static inline void
 shm_lock_mem_frags(shmem_info *shminfo)
 {
-   platform_spin_lock(&shminfo->shm_mem_frags_lock);
+   platform_mutex_lock(&shminfo->shm_mem_frags_mutex);


Better solution.... Otherwise, for some cases of new large_inserts_stress_test, we were simply burning up 100% CPU.

In all workloads, pthread-semaphore is less than 5%, sometimes even 1-2% as seen in perf top.

gapisback · 2023-04-15T00:05:03Z

src/platform_linux/shmem.c

+platform_save_usage_stats(shminfo_usage_stats *usage, shmem_info *shminfo)
+{
+   *usage                          = shminfo->usage;
+   usage->large_frags_found_in_use = platform_trace_large_frags(shminfo);


Much simpler rather than the line-by-line copy that old code from L303 onwards was doing.

gapisback · 2023-04-15T00:06:15Z

src/platform_linux/shmem.c

+   } else {
+      // Try to satisfy small memory fragments based on requested size, from
+      // cached list of free-fragments.
+      retptr = platform_shm_find_frag(shminfo, size, objname, func, file, line);


New find method to find small free fragments, thru its free-list-by-size.

gapisback · 2023-04-15T00:07:31Z

src/platform_linux/shmem.c

+{
+   ((free_frag_hdr *)ptr)->free_frag_next = *here;
+   ((free_frag_hdr *)ptr)->free_frag_size = size;
+   *here                                  = ptr;


Inserting free'd- fragment at the head of its free-list

Upcoming PR #569 is overhauling large-inserts stress test. To simplify examining the diffs of this test case as part of that review, this commit is renaming the test file to large_inserts_stress_test.c, with appropriate changes to the build Makefile and test files, to pickup new file.

Upcoming PR #569 is overhauling large-inserts stress test. To simplify examining the diffs of this test case as part of that review, this commit is renaming the test file to large_inserts_stress_test.c . Make appropriate changes to the build Makefile and test files, to pickup new file.

Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for upcoming small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() ... No other code-/logic-changes are done with this commit.

Upcoming PR #569 is bringing-in support for handling small fragments. This commit renames existing variables, field names and a few function names that deal with large-fragment support to consistently use 'large' in the name. This clears the way in the namespace for code changes coming from small-fragment changes. Some examples: - struct shm_frag_info -> struct shm_large_frag_info - E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ... - shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[] - shm_num_frags_tracked -> shm_nlarge_frags_tracked - platform_shm_find_free() -> platform_shm_find_large() NOTE: No other code-/logic-changes are done with this commit.

This commit refactors shared memory usage stats fields to drive-off shminfo_usage_stats{} struct entirely. Add platform_save_usage_stats(), used by platform_shm_print_usage(). This refactoring paves the way for upcoming PR #569 which is adding more memory-usage stats fields.

gapisback · 2024-01-25T06:58:43Z

@rtjohnso - The final part-3 shared memory support change-set is now ready for review.

The suggested order in which to review these diffs is:

src/platform_linux/platform.h
src/platform_linux/platform_inline.h
src/platform_linux/platform_types.h
src/util.h, src/util.c
src/platform_linux/shmem.h
src/platform_linux/shmem.c
src/trunk.h, src/trunk.c
src/routing_filter.c
Then the rest of the files.
Good luck!

rtjohnso · 2024-01-27T07:29:36Z

I think the current memfrag interface is leaky and not general.

I think the interface should look like this:

platform_status
platform_alloc(memfrag *mf, // OUT
               int size);

platform_status
platform_realloc(memfrag *mf, // IN/OUT
                 int newsize);

platform_status
platform_free(memfrag *mf); // IN

void *
memfrag_get_pointer(memfrag *mf);

(Note that details, like the exact names of the functions or the memfrag datatype are not too important in this example.)

The point is that the rest of the code should treat memfrags as opaque objects. In the current code, the rest of the code goes around pulling out fields and saving them for later use. It means that internal details of the current allocator implementation are being leaked all over the rest of the code. This will make it difficult to change the allocator implementation down the road.

As for names, I would advocate renaming memfrag to memory_allocation.

gapisback · 2024-01-27T16:16:14Z

HI, @rtjohnso --

Thanks for your initial approach on reworking the interfaces.

I'm happy to take this further, but I feel this round-trip discussion will become long and meandering. And this review panel UI exchange is not ideally suited for that kind of interaction.

I want to avoid re-doing the implementation till we've settled on and agreed to the new interfaces. Every bit of code rework requires massively editing the change-set and re-stabilizing - an effort I would like to avoid doing multiple times.

How about I start a new thread under Discussions tab, with your initial proposal? And, will give you my responses, rebuttal. I suspect we will have to go back-and-forth a few times before settling on the final interfaces.

(As a team, we haven't used the Discussions tab feature internally. As I am beginning my transition to fully out-of-VMware, it may be a good opportunity to engage using this GitHub feature, so it continues even when I'm a fully O-Sourced' engineer.)

…ject mgmt The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits. Another driving factor for implementing some free-list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point, and we'd frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle). The main design changes to manage small-fragments are follows: Managing memory allocation / free using platform_memfrag{} fragments: - Allocation and free of memory is dealt with in terms of "memory fragments", a small structure that holds the memory->{addr, size}. All memory requests (as is being done previously) are aligned to the cacheline. - Allocation: All clients of memory allocation have to "hand-in" an opaque platform_memfrag{} handle, which will be returned populated with the memory address, and more importantly, the size-of-the-fragment that was used to satisfy the memory request. - Free: Clients now have to safely keep a handle to this returned platform_memfrag{}, and hand it back to the free() method. free() will rely "totally" on the size specified in this input fragment handle supplied. And the free'd memory fragment will be returned to the corresponding free-list bucket. - Upon free(), the freed-fragment is tracked in a few free-lists bucketed by size of the freed-fragment. For now, we support 4 buckets, size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient for current benchmarking requirements.) A free'd fragment is hung off of the corresponding list, threading the free-fragments using the fragment's memory itself. New struct free_frag_hdr{} provides the threading structure. It tracks the current fragment's size and free_frag_next pointer. The 'size' provided to the free() call is is recorded as the free'd fragment's size. - Subsequently, a new alloc() request is 1st satisfied by searching the free-list corresponding to the memory request. For example, a request from a client for 150 bytes will be rounded-up a cacheline boundary, i.e. 192 bytes. The free-list for bucket 256 bytes will be searched to find the 1st free-fragment of the right size. If no free fragment is found in the target list, we then allocate a new fragment. The returned fragment will have a size of 256 (for an original request of 150 bytes). - An immediate consequence of this approach is that there is a small, but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(), TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z' equivalents, which return 0'ed out memory. - All existing clients of the various TYPED*() memory allocation calls have been updated to declare an on-stack platform_memfrag{} handle, which is passed back to platform_free(). - In some places memory is allocated to initialize sub-systems and then torn down during deinit(). In a few places existing structures are extended to track an additional 'size' field. The size of the memory fragment allocated during init() is recorded here, and then used to invoke platform_free() as part of the deinit() method. An example is clockcache_init() where this kind of work to record the 'size' of the fragment is done and passed-down to clockcache_deinit(), where the memory fragment is then freed with the right 'size'. This pattern is now to be seen in many such init()/deinit() methods of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ... - Cautionary Note: If the 'ptr' handed to platform_free() is not of type platform_memfrag{} *, it is treated as a generic <struct> *, and its sizeof() will be used as the 'size' of the fragment to free. This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.) - Copious debug and platform asserts have been added in shmem alloc/free methods to cross-check to some extent illegal calls. Fingerprint Object Management: Managing memory for fingerprint arrays was particularly problematic. This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process. The requirements from the new interfaces are handled as follows: - Added a new fingerprint{} object, struct fp_hdr{}, which embeds at its head a platform_memfrag{}. And few other short fields are added for tracking fingerprint memory mgmt gyrations. - Various accessor methods are added to manage memory for fingerprint arrays through this object. E.g., - fingerprint_init() - Will allocate required fingerprint for 'ntuples'. - fingerprint_deinit() - Will dismantle object and free the memory - fingerprint_start() - Returns start of fingerprint array's memory - fingerprint_nth() - Returns n'th element of fingerprint Packaging the handling of fingerprint array through this object and its interfaces helped greatly to stabilize the memory histrionics. - When SplinterDB is closed, shared memory dismantling routine will tag any large-fragments that are still found "in-use". This is percolated all the way back to splinterdb_close(), unmount() and to platform_heap_destory() as a failure $rc. Test will fail if they have left some un-freed large fragments. (Similar approach was considered to book-keep all small fragments used/freed, but due to some rounding errors, it cannot be a reliable check at this time. So hasn't been done.) Test changes: Miscellaneous: - Elaborate and illustrative tracing added to track memory mgmt done for fingerprint arrays, especially when they are bounced around queued / re-queued tasks. (Was a very problematic debugging issue.) - Extended tests to exercise core memory allocation / free APIs, and to exercise fingerprint object mgmt, and writable_buffer interfaces: - platform_apis_test: - splinter_shmem_test.c: Adds specific test-cases to verify that free-list mgmt is happening correctly. - Enhanced various diagnostics, asserts, tracing - Improved memory usage stats gathering and reporting - Added hooks to cross-check multiple-frees of fragments, and testing hooks to verify if a free'd fragment is relocated to the right free-list - Add diagram for large-free fragment tracking.

This commit reworks the interfaces on the lines discussed in this discussion thread: #615 void * platform_alloc(memfrag *mf, int size, ...); void * platform_realloc(memfrag *mf, int newsize); void platform_free(memfrag *mf); // IN Currently, the return from `platform_free` still remains as void. Changing it to platform_status will mean plumbing the return handling to all callers. Also C-system call `free()` is defined as `void`. So changing `platform_free` to return platform_status will be a bit inconsistent.

gapisback · 2024-02-01T02:45:43Z

@rtjohnso - My CI-stabilization jobs have succeeded. I have squashed all changes arising from our proposal discussion thread into this one single commit and have refreshed this change-set.

You can restart your review on this amended change-set. (I expect CI-jobs will succeed as they did in the stabilization PR #616 )

gapisback · 2024-02-01T21:33:47Z

@rtjohnso : Fyi -- I want to log this one ASAN-instability the most recent round of CI-jobs ran into, as I am not going to remember all this later.

Here is the state of affairs and results of my investigations.

CI Job no. 109 (main-pr-asan) job failed with this error:

 build/release-asan/bin/driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 4 --num-lookup-threads 4 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

build/release-asan/bin/driver_test: splinterdb_build_version 9281c83f

Dispatch test splinter_test

Attempt to create shared segment of size 8589934592 bytes.

Created shared memory of size 8589934592 bytes (8 GiB), shmid=8617984.

Completed setup of shared memory of size 8589934592 bytes (8 GiB), shmaddr=0x7f6924570000, shmid=8617984, available memory = 8589894272 bytes (~7.99 GiB).

filter-index-size: 256 is too small, setting to 512

Running splinter_test with 1 caches

splinter_test: SplinterDB performance test started with 1 tables

splinter_perf_inserts() starting num_insert_threads=4, num_threads=4, num_inserts=27185152 (~27 million) ...

Thread 2 inserting  37% complete for table 0 ... =================================================================

==2666==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f68fba16f80 at pc 0x7f6b276fef50 bp 0x7f69044aa080 sp 0x7f69044a9828

READ of size 589 at 0x7f68fba16f80 thread T1

Thread 2 inserting  42% complete for table 0 ... OS-pid=2666, OS-tid=2669, Thread-ID=3, Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

./test.sh: line 115:  2666 Aborted                 "$@"

make: *** [Makefile:558: run-tests] Error 134

Upon re-run this asan job no. 109.1 succeeded.
Attempted to manually re-run this specific test multiple times on my Nimbus-VM,but could not reproduce the ASAN heap-buffer-overflow error. Ran the exact test with different combinations (one run, 4 concurrent runs with the exact same params, 4 concurrent executions with increasing thread-count up to --num-insert-threads 8 --num-lookup-threads 8, and similar stress load on the VM), but could not repro the problem outside CI.

The last variation of this test in manual repro attempts I tried is 4 concurrent invocations of this test: (Logging this here so I can refer to this later on.)

./driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 8 --num-lookup-threads 8 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

The VM has 16 vCPUs, so I figured by running with 8 insert-threads and 4 concurrent instances, we'd load the CPU high-enough to tickle any bugs out. But the ASAN problem did not recur in these manual repro attempts.

NOTE: In the original failure in CI, hard to tell exactly, but it seems like the thread ID 2 ran into the ASAN memory over flow and soon after, thread ID=3 ran into this assertion a few lines later:

OS-pid=2666, OS-tid=2669, Thread-ID=3,  Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

You may recall that I had reported issue #474 some time ago for this trunk bundle mgmt assertion.

I suspect that there is something lurking there that popped up in the CI-run.

I cannot explain how / whether / if this assertion tripping is caused by the ASAN heap-buffer-overflow error or if they are even related. Unfortunately, I could not repro the ASAN issue outside CI, so have to give up on this investigation now.

The rest of the test runs are stable, and this ASAN-job did succeed on a re-run. I have re-reviewed the code-diffs applied recently and could not find anything obviously broken. For now, I will have to conclude that the changes are fine except there may be some hidden instability popping up, possibly triggered by issue #474 mentioned earlier.

rtjohnso

I've mostly just gone through the headers in the platform code, plus the fingerprint array api.

I assume once we get these nailed down, then most of the changes in the rest of the code will be relatively straightforward updates to the new apis.

Or is there anything else major?

Let's get the new apis sorted and then I can review the whole PR.

rtjohnso · 2024-01-26T08:39:07Z

include/splinterdb/splinterdb.h

@@ -179,7 +179,7 @@ splinterdb_open(splinterdb_config *cfg, splinterdb **kvs);
 // Close a splinterdb
 //
 // This will flush all data to disk and release all resources
-void
+int


Good change. Can you add a comment defining the meaning of the return value? e.g.

"returns 0 on success, non-zero otherwise."

Or

"returns

0 on success,

a positive integer when all data has been persisted but not all resources were able to be released, and

a negative number to indicate that not all data was able to be persisted and the database was unable to shut down safely."

rtjohnso · 2024-02-02T00:41:29Z

src/btree.c

@@ -3167,8 +3173,8 @@ btree_pack_loop(btree_pack_req *req,       // IN/OUT
   log_trace_key(tuple_key, "btree_pack_loop (bottom)");

   if (req->hash) {
-      platform_assert(req->num_tuples < req->max_tuples);
-      req->fingerprint_arr[req->num_tuples] =
+      platform_assert(btree_pack_can_fit_tuple(req));


rtjohnso · 2024-02-02T09:28:39Z

src/platform_linux/platform.c

   platform_histo_handle hh;
-   hh = TYPED_MANUAL_MALLOC(
+   hh = TYPED_ARRAY_MALLOC(


Actually, I believe the correct macro for this situation is TYPED_FLEXIBLE_STRUCT_ZALLOC.

rtjohnso · 2024-02-02T09:39:47Z

src/platform_linux/platform.h

-#define TYPED_ALIGNED_ZALLOC(hid, a, v, n)                                     \
+
+#define TYPED_ALIGNED_MALLOC(hid, a, v, n)                                     \
+   TYPED_ALIGNED_MALLOC_MF(&memfrag_##v, hid, a, v, n)


I hate this hard-coding of this naming convention, and I think it doesn't solve an important problem.

Let me propose something that might be more useful.

#define TYPED_ALIGNED_MALLOC_AUTO(hid, a, v, n) \ platform_memfrag v_##memfrag __attribute__(cleanup(platform_free)); TYPED_ALIGNED_MALLOC_MF(&v_##memfrag, hid, a, v, n)

This allocates memory that will automatically get freed when the function exits the scope of the allocation.

Furthermore, it alleviates the user of the responsibility to declare a memfrag at all.

rtjohnso · 2024-02-02T09:41:18Z

src/platform_linux/platform.h

+ * Utility macro to test if an argument to platform_free() is a
+ * platform_memfrag *.
+ */
+#define IS_MEM_FRAG(x)                                                         \


This will go away, right?

rtjohnso · 2024-02-02T09:48:44Z

src/platform_linux/platform.h

+                                       __FILE__,                               \
+                                       __LINE__);                              \
+      _mf->addr = NULL;                                                        \
+      _mf->size = 0;                                                           \
   } while (0)

 // Convenience function to free something volatile
 static inline void
 platform_free_volatile_from_heap(platform_heap_id heap_id,
                                 volatile void   *ptr,


This will change to accept a memfrag, right?

rtjohnso · 2024-02-02T09:49:23Z

src/platform_linux/platform_inline.h

@@ -40,10 +40,11 @@ platform_checksum_is_equal(checksum128 left, checksum128 right)
 static void
 platform_free_from_heap(platform_heap_id UNUSED_PARAM(heap_id),
                        void            *ptr,
+                        const size_t     size,


This will be modified to take a memfrag, right?

rtjohnso · 2024-02-02T09:49:56Z

src/platform_linux/platform.h

-      platform_free_volatile_from_heap(                                        \
-         id, (p), STRINGIFY(p), __func__, __FILE__, __LINE__);                 \
-      (p) = NULL;                                                              \
+      debug_assert(((p) != NULL),                                              \


Once you change platform_free_volatile_from_heap to take a memfrag, you won't need all this type checking.

rtjohnso · 2024-02-02T09:50:56Z

src/platform_linux/platform_inline.h

+   void *retptr = NULL;
+   if (heap_id == PROCESS_PRIVATE_HEAP_ID) {
+      retptr = aligned_alloc(alignment, required);
+      if (memfrag) {


Get rid of the if. Always require a memfrag parameter.

rtjohnso · 2024-02-02T09:51:45Z

src/platform_linux/platform_inline.h

 static inline void
 platform_free_from_heap(platform_heap_id heap_id,
                        void            *ptr,
+                        const size_t     size,


Take a memfrag instead.

gapisback · 2024-02-02T14:27:34Z

@rtjohnso - I've gone thru your review comments quickly. Most of those are easily implementable. I will get to it.

I've mostly just gone through the headers in the platform code, plus the fingerprint array api.

I am curious about your review of the fingerprint array API rework. Did you not find any issues with that? I was bracing myself to get lots of comments as this area is fragile and the rework is a bit tricky. If you think this array API is acceptable, then that will reduce a bunch of rework rounds on me.

Let's get the new apis sorted and then I can review the whole PR.

Let me apply the changes requested and then re-test. (CI-re-test stabilization will be a nightmare starting tomorrow.)

Once I go over all the changes, I will be better able to answer this question of yours:

Or is there anything else major?

... for which the answer now is, I don't think so, off-hand.

rtjohnso · 2024-02-02T18:49:59Z

I left a few comments on the fingerprint array code already.

I haven't done a full evaluation. It seemed more complex than I expected, but I see that it is trying to make explicit some of the complex sharing that goes on with the fingerprint arrays, which is a goal I like. I will want to do a more thorough review of how it is used to understand how it all fits together.

rtjohnso · 2024-02-02T20:42:35Z

I spoke with Alex today about the overall design, and he really doesn't like how the whole concept of memfrags puts a burden on the rest of the code.

So let's do the following. Whenever the shm code allocates memory, it allocates one extra cache line in front, and stores the memfrag on that cacheline. Later, during a free, you use pointer arithmetic to find the memfrag for that pointer.

vmwclabot added the cla-not-required label Apr 14, 2023

gapisback commented Apr 14, 2023

View reviewed changes

gapisback commented Apr 15, 2023

View reviewed changes

gapisback mentioned this pull request Jan 24, 2024

shmem.c: Rename variables relating to large-fragment handling. #609

Merged

gapisback mentioned this pull request Jan 24, 2024

Refactor to save / print shmem usage stats using common struct. #610

Merged

gapisback force-pushed the agurajada/shmem-free-list-mgmt-Rev branch from 6c24747 to 33a95f2 Compare January 25, 2024 06:45

gapisback removed request for rosenhouse and ajhconway January 25, 2024 07:01

This was referenced Jan 25, 2024

Stabilization changes for platform_free() and free-list fragment management. #565

Closed

Rework fingerprint array as derived object from memfrag{} object. #564

Closed

gapisback mentioned this pull request Jan 31, 2024

Stabilization of part-3 of shared-memory support, based on review comments. #616

Draft

gapisback force-pushed the agurajada/shmem-free-list-mgmt-Rev branch from 3588f3e to 9281c83 Compare February 1, 2024 02:42

rtjohnso requested changes Feb 2, 2024

View reviewed changes

Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

Are you sure you want to change the base?

Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

Conversation

gapisback commented Apr 14, 2023 • edited Loading

Design Overview

Managing memory allocation / free using platform_memfrag{} fragments

Cautionary Note

Fingerprint Object Management

Test changes

netlify bot commented Apr 14, 2023 • edited Loading

✅ Deploy Preview for splinterdb canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gapisback Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gapisback Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gapisback commented Jan 25, 2024

rtjohnso commented Jan 27, 2024

gapisback commented Jan 27, 2024

gapisback commented Feb 1, 2024

gapisback commented Feb 1, 2024 • edited Loading

rtjohnso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gapisback commented Feb 2, 2024

rtjohnso commented Feb 2, 2024

rtjohnso commented Feb 2, 2024

gapisback commented Apr 14, 2023 •

edited

Loading

Managing memory allocation / free using `platform_memfrag{}` fragments

netlify bot commented Apr 14, 2023 •

edited

Loading

gapisback Apr 14, 2023 •

edited

Loading

gapisback Apr 14, 2023 •

edited

Loading

gapisback commented Feb 1, 2024 •

edited

Loading