Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gapisback
Copy link
Collaborator

@gapisback gapisback commented Apr 14, 2023

The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits.

Another driving factor for implementing free-fragment list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point. We would frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle).

Design Overview

The main design changes to manage small-fragments are follows:

Managing memory allocation / free using platform_memfrag{} fragments

  • Allocation and free of memory is dealt with in terms of "memory fragments", a small structure that holds the memory->{addr, size}. All memory requests (as is being done previously) are aligned to
    the cacheline.

    • Allocation: All clients of memory allocation have to "hand-in" an opaque platform_memfrag{} handle, which will be returned populated with the memory address, and more importantly, the size-of-the-fragment that was used to satisfy the memory request.

    • Free: Clients now have to safely keep a handle to this returned platform_memfrag{}, and hand it back to the free() method. free() will rely "totally" on the size specified in this input fragment handle supplied. And the free'd memory fragment will be returned to the corresponding free-list bucket, if the fragment's size is one in a small set of free-fragments being tracked.

  • Upon free(), the freed-fragment is tracked in a few free-lists bucketed by size of the freed-fragment. For now, we support 4 buckets, size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient
    for current benchmarking requirements.)

    A free'd fragment is hung off of the corresponding list, threading the free-fragments using the fragment's memory itself.

  • New struct free_frag_hdr{} provides the threading structure. It tracks the current fragment's size and free_frag_next pointer. The 'size' provided to the free() call is is recorded as the free'd fragment's size.

  • Subsequently, a new alloc() request is 1st satisfied by searching the free-list corresponding to the memory request.

For example, a request from a client for 150 bytes will be rounded-up to a cacheline boundary,
i.e. 192 bytes. The free-list for bucket 256 bytes will be searched to find the 1st free-fragment of the right size. If no free fragment is found in the target list, we then allocate a new fragment. The returned fragment will have a size of 256 (for an original request of 150 bytes).

  • An immediate consequence of this approach is that there is a small, but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(), TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z' equivalents, which return 0'ed out memory.

  • All existing clients of the various TYPED*() memory allocation calls have been updated to declare an on-stack platform_memfrag{} handle, which is passed back to platform_free().

  • In some places memory is allocated to initialize sub-systems and then torn down during deinit(). In a few places existing structures are extended to track an additional 'size' field. The size of the memory fragment allocated during init() is recorded here, and then used to invoke platform_free() as part of the deinit() method.

    • An example is clockcache_init() where this kind of work to record the 'size' of the fragment is done and passed-down to clockcache_deinit(), where the memory fragment is then freed with the right 'size'.

    This pattern is now to be seen in many such init()/deinit() methods of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ...

  • Copious debug and platform asserts have been added in shmem alloc/free methods to cross-check to some extent illegal calls.

Cautionary Note

If the 'ptr' handed to platform_free() is not of type platform_memfrag{} *, it is treated as a generic *, and its sizeof() will be used as the 'size' of the fragment to free.

This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if
we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.)

Fingerprint Object Management

Managing memory for fingerprint arrays was particularly problematic.

This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process.

The requirements from the new interfaces are handled as follows:

  • Added a new fingerprint{} object, struct fp_hdr{}, which embeds at its head a platform_memfrag{}. And few other short fields are added for tracking fingerprint memory mgmt gyrations.

  • Various accessor methods are added to manage memory for fingerprint arrays through this object.
    E.g.,

    • fingerprint_init() - Will allocate required fingerprint for 'ntuples'.
    • fingerprint_deinit() - Will dismantle object and free the memory
    • fingerprint_start() - Returns start of fingerprint array's memory
    • fingerprint_nth() - Returns n'th element of fingerprint

Packaging the handling of fingerprint array through this object and its interfaces helped greatly to stabilize the memory histrionics.

  • When SplinterDB is closed, shared memory dismantling routine will tag any large-fragments that are still found "in-use". This is percolated all the way back to splinterdb_close(), unmount() and to
    platform_heap_destory() as a failure $rc. Test will fail if they have left some un-freed large fragments.
    (Similar approach was considered to book-keep all small fragments used/freed, but due to some rounding errors, it cannot be a reliable check at this time. So hasn't been done.)

Test changes

Miscellaneous

  • Elaborate and illustrative tracing added to track memory mgmt done for fingerprint arrays, especially when they are bounced around queued / re-queued tasks. (Was a very problematic debugging issue.)

  • Extended tests to exercise core memory allocation / free APIs, and to exercise fingerprint object mgmt, and writable_buffer interfaces:

    • platform_apis_test:
    • splinter_shmem_test.c: Adds specific test-cases to verify that free-list mgmt is happening correctly.
  • Enhanced various diagnostics, asserts, tracing

  • Improved memory usage stats gathering and reporting

  • Added hooks to cross-check multiple-frees of fragments, and testing hooks to verify if a free'd fragment is relocated to the right free-list

@netlify
Copy link

netlify bot commented Apr 14, 2023

Deploy Preview for splinterdb canceled.

Name Link
🔨 Latest commit 9281c83
🔍 Latest deploy log https://app.netlify.com/sites/splinterdb/deploys/65bb050c366a1500096d0cad

@@ -179,7 +179,7 @@ splinterdb_open(splinterdb_config *cfg, splinterdb **kvs);
// Close a splinterdb
//
// This will flush all data to disk and release all resources
void
int
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To percolate errors found by shm-destroy, if large-fragments not free are still found hanging around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change. Can you add a comment defining the meaning of the return value? e.g.

"returns 0 on success, non-zero otherwise."

Or

"returns

  • 0 on success,
  • a positive integer when all data has been persisted but not all resources were able to be released, and
  • a negative number to indicate that not all data was able to be persisted and the database was unable to shut down safely."

@@ -380,7 +380,6 @@ void PACKEDARRAY_JOIN(__PackedArray_unpack_, PACKEDARRAY_IMPL_BITS_PER_ITEM)(con
#include "poison.h"

#define PACKEDARRAY_MALLOC(size) platform_malloc(size)
#define PACKEDARRAY_FREE(p) platform_free(p)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused interface.

platform_assert(req->num_tuples < req->max_tuples);
req->fingerprint_arr[req->num_tuples] =
platform_assert(btree_pack_can_fit_tuple(req));
fingerprint_start(&req->fingerprint)[req->num_tuples] =
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where you will start to see the use of fingerprint object and its accessor / interfaces.

uint32 *fingerprint_arr; // IN/OUT: hashes of the keys in the tree
hash_fn hash; // hash function used for calculating filter_hash
unsigned int seed; // seed used for calculating filter_hash
fp_hdr fingerprint; // IN/OUT: hashes of the keys in the tree
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-place char * array is now replaced by the fingerprint object, which carries inside of it platform_memfrag{} handle to track allocate memory fragment's size and to free it reliably.

req->fingerprint_arr =
TYPED_ARRAY_ZALLOC(hid, req->fingerprint_arr, max_tuples);

fingerprint_init(&req->fingerprint, hid, max_tuples); // Allocates memory
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline memory allocation on old L345 is, henceforth, replaced by init()'ing the fingerprint object ... And so on ...

"Unable to allocate memory for %lu tuples",
max_tuples);
if (!req->fingerprint_arr) {
if (fingerprint_is_empty(&req->fingerprint)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can no longer check for NULL array ptr, to detect OOM. You must consult the is_empty() method to figure out if there is memory or not.

if (req->fingerprint_arr) {
platform_free(hid, req->fingerprint_arr);
if (!fingerprint_is_empty(&req->fingerprint)) {
fingerprint_deinit(hid, &req->fingerprint);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deinit() will free memory.

if (!cc->lookup) {
goto alloc_error;
}
cc->lookup_size = memfrag_size(&memfrag_cc_lookup);
Copy link
Collaborator Author

@gapisback gapisback Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the first instance of a pair of init / deinit calls, which now need to communicate the size of memory fragment allocated by init().

Like it's done on this line, few common structures now gain a new size field to track the memory fragment's size. These structures are of the kind where they are allocated / init'ed in one function and much later the deinit() method is called in a separate function.

src/clockcache.c Outdated
if (cc->lookup) {
platform_free(cc->heap_id, cc->lookup);
memfrag_init_size(mf, cc->lookup, cc->lookup_size);
platform_free(cc->heap_id, mf);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free() needs to be told the fragment's size correctly. This is obtained from the size field stashed away when init() was done.

src/clockcache.c Outdated
}
if (cc->entry) {
platform_free(cc->heap_id, cc->entry);
memfrag_init_size(mf, cc->entry, cc->entry_size);
platform_free(cc->heap_id, mf);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pattern of changes continues. This will appear in many more instances ...

src/memtable.c Outdated
platform_memfrag memfrag_ctxt;
platform_memfrag *mf = &memfrag_ctxt;
memfrag_init_size(mf, ctxt, ctxt->mt_ctxt_size);
platform_free(hid, mf);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This does look like verbose multi-line repeat code.

@rtjohnso - I did consider whether to add a packaged macro, say, memfrag_init(), to which you supply the addr / init. Inside the body, we can declare a hidden structures and platform_memfrag *mf, do the setup, and pass-it as the 2nd arg.

Can be done ... probably .. did not try it too hard. Wanted to get this into review, and then I expect anyway to get comments on this approach.

We can re-discuss the coding impact this approach has ... and re-visit during review.

@@ -287,7 +291,11 @@ io_handle_deinit(laio_handle *io)
}
platform_assert(status == 0);

platform_free(io->heap_id, io->req);
platform_memfrag memfrag = {.addr = io->req, .size = io->req_size};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE to myself: This should go away. Currently, memfrag_init_size() is a #define and this struct is exposed in platform.h. Rework this so the fields are hidden, and only memfrag_init_size() is exposed to client code.

This will prevent such naked assignments.

@@ -84,13 +84,16 @@ platform_heap_create(platform_module_id UNUSED_PARAM(module_id),
return STATUS_OK;
}

void
platform_status
Copy link
Collaborator Author

@gapisback gapisback Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percolating errors upstream from platform_shmdestroy; see L91 below.

platform_histo_handle hh;
hh = TYPED_MANUAL_MALLOC(
hh = TYPED_ARRAY_MALLOC(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equivalent calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I believe the correct macro for this situation is TYPED_FLEXIBLE_STRUCT_ZALLOC.

({ \
debug_assert((n) >= sizeof(*(v))); \
(typeof(v))platform_aligned_malloc(hid, \
PLATFORM_CACHELINE_SIZE, \
(n), \
(mf), \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocation now receives and returns a platform_memfrag{} *, So, macro's call changes.

@@ -368,13 +371,13 @@ extern platform_heap_id Heap_id;
({ \
debug_assert((n) >= sizeof(*(v))); \
(typeof(v))platform_aligned_malloc( \
hid, (a), (n), STRINGIFY(v), __func__, __FILE__, __LINE__); \
hid, (a), (n), NULL, STRINGIFY(v), __func__, __FILE__, __LINE__); \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most of the consumers, this is good-enough. I could have changed this to require all callers to also declare an on-stack platform_memfrag{}, but that would be more code changes.

The one 'minor' issue with this is that we might incorrectly free a smaller-sized fragment . But that's not a huge loss, so I went with current solution.

"Attempt to free a NULL ptr from '%s', line=%d", \
__func__, \
__LINE__); \
if (IS_MEM_FRAG(p)) { \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is key ... and needs understanding. Please review carefully.

const size_t _reqd = \
(_size + platform_alignment(PLATFORM_CACHELINE_SIZE, _size)); \
platform_free_mem((hid), (p), _reqd, STRINGIFY(p)); \
(p) = NULL; \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtjohnso - This line and L907 below is what makes it necessary for clients calling free() to do two things:

   platform_memfrag memfrag;
   platform_memfrag *mf;
   ... Do the initialization ...
  platform_free(mf);

I would have liked to skip the mf and simply pass-in &memfrag to free(), but there is a compiler error.

I think this can be fixed with some rework ... but I ran out of energy. Let's review if this can be improved.

platform_do_realloc(const platform_heap_id heap_id,
const size_t oldsize,
void *ptr, // IN
size_t *newsize, // IN/OUT
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reallocation now returns the *newsize, so clients like writable buffer resize can record the new fragment's size in its buffer_capacity field.

This, then, allows writable_buffer_deinit() to correctly supply the newly realloc'ed fragment's size to free.

void *retptr = (heap_id ? splinter_shm_alloc(heap_id, required, objname)
: aligned_alloc(alignment, required));
void *retptr = NULL;
if (heap_id == PROCESS_PRIVATE_HEAP_ID) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified usage of the semantic of heap_id; NULL means process-private-heap-ID, so go thru old malloc()-style code-flow.

int frag_allocated_to_pid; // Allocated to this OS-pid
int frag_freed_by_pid; // OS-pid that freed this fragment
threadid frag_freed_by_tid; // Splinter thread-ID that freed this
int frag_line;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changed, indented and aligned fields for readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: Nothing changed ... only indentation changes ...

# define SHM_LARGE_FRAG_SIZE (90 * KiB)
#else
# define SHM_LARGE_FRAG_SIZE (38 * KiB)
#endif // SPLINTER_DEBUG
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff in this limit was causing unit-tests to fail in debug builds and pass in release build.

This artifact is leftover from my poc-dev days ... no longer should be needed to separate out limits.

Have stabilized on 32K as lower limit for large fragments in both builds. Moved to shmem.h

typedef struct free_frag_hdr {
struct free_frag_hdr *free_frag_next;
size_t free_frag_size;
} free_frag_hdr;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to chain free'd-fragments that are returned to the free-list. This tiny struct lives at the head of each free fragment.

Min frag-size is 64 bytes, so we have room.

* can print these after shared segment has been destroyed.
* ------------------------------------------------------------------------
*/
typedef struct shminfo_usage_stats {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated all metrics, usage-stats into this common struct.

It will be update in-place nested in the shmem control block.

And will also be used to return metrics when shared memory is being dismantled.

Works much better this way!!

@@ -198,13 +309,13 @@ platform_shm_hip(platform_heap_id hid)
static inline void
shm_lock_mem_frags(shmem_info *shminfo)
{
platform_spin_lock(&shminfo->shm_mem_frags_lock);
platform_mutex_lock(&shminfo->shm_mem_frags_mutex);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better solution.... Otherwise, for some cases of new large_inserts_stress_test, we were simply burning up 100% CPU.

In all workloads, pthread-semaphore is less than 5%, sometimes even 1-2% as seen in perf top.

platform_save_usage_stats(shminfo_usage_stats *usage, shmem_info *shminfo)
{
*usage = shminfo->usage;
usage->large_frags_found_in_use = platform_trace_large_frags(shminfo);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much simpler rather than the line-by-line copy that old code from L303 onwards was doing.

} else {
// Try to satisfy small memory fragments based on requested size, from
// cached list of free-fragments.
retptr = platform_shm_find_frag(shminfo, size, objname, func, file, line);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New find method to find small free fragments, thru its free-list-by-size.

{
((free_frag_hdr *)ptr)->free_frag_next = *here;
((free_frag_hdr *)ptr)->free_frag_size = size;
*here = ptr;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inserting free'd- fragment at the head of its free-list

gapisback added a commit that referenced this pull request Dec 10, 2023
Upcoming PR #569 is overhauling large-inserts stress test.
To simplify examining the diffs of this test case as part
of that review, this commit is renaming the test file
to large_inserts_stress_test.c, with appropriate changes
to the build Makefile and test files, to pickup new file.
gapisback added a commit that referenced this pull request Dec 10, 2023
Upcoming PR #569 is overhauling large-inserts stress test.
To simplify examining the diffs of this test case as part
of that review, this commit is renaming the test file
to large_inserts_stress_test.c . Make appropriate changes
to the build Makefile and test files, to pickup new file.
gapisback added a commit that referenced this pull request Dec 10, 2023
Upcoming PR #569 is overhauling large-inserts stress test.
To simplify examining the diffs of this test case as part
of that review, this commit is renaming the test file
to large_inserts_stress_test.c . Make appropriate changes
to the build Makefile and test files, to pickup new file.
gapisback added a commit that referenced this pull request Jan 24, 2024
Upcoming PR #569 is bringing-in support for handling small
fragments. This commit renames existing variables, field names
and a few function names that deal with large-fragment support
to consistently use 'large' in the name. This clears the way
in the namespace for upcoming small-fragment changes.

Some examples:
- struct shm_frag_info -> struct shm_large_frag_info
- E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ...
- shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[]
- shm_num_frags_tracked -> shm_nlarge_frags_tracked
- platform_shm_find_free() -> platform_shm_find_large()

... No other code-/logic-changes are done with this commit.
gapisback added a commit that referenced this pull request Jan 24, 2024
Upcoming PR #569 is bringing-in support for handling small
fragments. This commit renames existing variables, field names
and a few function names that deal with large-fragment support
to consistently use 'large' in the name. This clears the way
in the namespace for code changes coming from small-fragment
changes.

Some examples:
- struct shm_frag_info -> struct shm_large_frag_info
- E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ...
- shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[]
- shm_num_frags_tracked -> shm_nlarge_frags_tracked
- platform_shm_find_free() -> platform_shm_find_large()

NOTE: No other code-/logic-changes are done with this commit.
gapisback added a commit that referenced this pull request Jan 24, 2024
Upcoming PR #569 is bringing-in support for handling small
fragments. This commit renames existing variables, field names
and a few function names that deal with large-fragment support
to consistently use 'large' in the name. This clears the way
in the namespace for code changes coming from small-fragment
changes.

Some examples:
- struct shm_frag_info -> struct shm_large_frag_info
- E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ...
- shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[]
- shm_num_frags_tracked -> shm_nlarge_frags_tracked
- platform_shm_find_free() -> platform_shm_find_large()

NOTE: No other code-/logic-changes are done with this commit.
gapisback added a commit that referenced this pull request Jan 24, 2024
Upcoming PR #569 is bringing-in support for handling small
fragments. This commit renames existing variables, field names
and a few function names that deal with large-fragment support
to consistently use 'large' in the name. This clears the way
in the namespace for code changes coming from small-fragment
changes.

Some examples:
- struct shm_frag_info -> struct shm_large_frag_info
- E.g., shm_frag_addr -> frag_addr, shm_frag_size -> frag_size ...
- shm_frag_info shm_mem_frags[] -> shm_large_frag_info shm_large_frags[]
- shm_num_frags_tracked -> shm_nlarge_frags_tracked
- platform_shm_find_free() -> platform_shm_find_large()

NOTE: No other code-/logic-changes are done with this commit.
gapisback added a commit that referenced this pull request Jan 24, 2024
This commit refactors shared memory usage stats fields to
drive-off shminfo_usage_stats{} struct entirely. Add
platform_save_usage_stats(), used by platform_shm_print_usage().

This refactoring paves the way for upcoming PR #569 which
is adding more memory-usage stats fields.
gapisback added a commit that referenced this pull request Jan 25, 2024
This commit refactors shared memory usage stats fields to
drive-off shminfo_usage_stats{} struct entirely. Add
platform_save_usage_stats(), used by platform_shm_print_usage().

This refactoring paves the way for upcoming PR #569 which
is adding more memory-usage stats fields.
@gapisback gapisback force-pushed the agurajada/shmem-free-list-mgmt-Rev branch from 6c24747 to 33a95f2 Compare January 25, 2024 06:45
@gapisback
Copy link
Collaborator Author

@rtjohnso - The final part-3 shared memory support change-set is now ready for review.

The suggested order in which to review these diffs is:

  1. src/platform_linux/platform.h
  2. src/platform_linux/platform_inline.h
  3. src/platform_linux/platform_types.h
  4. src/util.h, src/util.c
  5. src/platform_linux/shmem.h
  6. src/platform_linux/shmem.c
  7. src/trunk.h, src/trunk.c
  8. src/routing_filter.c
  9. Then the rest of the files.
  10. Good luck!

@rtjohnso
Copy link
Contributor

I think the current memfrag interface is leaky and not general.

I think the interface should look like this:

platform_status
platform_alloc(memfrag *mf, // OUT
               int size);

platform_status
platform_realloc(memfrag *mf, // IN/OUT
                 int newsize);

platform_status
platform_free(memfrag *mf); // IN

void *
memfrag_get_pointer(memfrag *mf);

(Note that details, like the exact names of the functions or the memfrag datatype are not too important in this example.)

The point is that the rest of the code should treat memfrags as opaque objects. In the current code, the rest of the code goes around pulling out fields and saving them for later use. It means that internal details of the current allocator implementation are being leaked all over the rest of the code. This will make it difficult to change the allocator implementation down the road.

As for names, I would advocate renaming memfrag to memory_allocation.

@gapisback
Copy link
Collaborator Author

HI, @rtjohnso --

Thanks for your initial approach on reworking the interfaces.

I'm happy to take this further, but I feel this round-trip discussion will become long and meandering. And this review panel UI exchange is not ideally suited for that kind of interaction.

I want to avoid re-doing the implementation till we've settled on and agreed to the new interfaces. Every bit of code rework requires massively editing the change-set and re-stabilizing - an effort I would like to avoid doing multiple times.

How about I start a new thread under Discussions tab, with your initial proposal? And, will give you my responses, rebuttal. I suspect we will have to go back-and-forth a few times before settling on the final interfaces.

(As a team, we haven't used the Discussions tab feature internally. As I am beginning my transition to fully out-of-VMware, it may be a good opportunity to engage using this GitHub feature, so it continues even when I'm a fully O-Sourced' engineer.)

…ject mgmt

The main change with this commit is the support for free-fragment lists
and recycling of small fragments from shared memory. This was a main
limitation of the support added in previous commits. Another driving
factor for implementing some free-list support was that previous
multi-user concurrent insert performance benchmarking was not functional
beyond a point, and we'd frequently run into shmem Out-Of-Memory (OOMs),
even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle).

The main design changes to manage small-fragments are follows:

Managing memory allocation / free using platform_memfrag{} fragments:

- Allocation and free of memory is dealt with in terms of "memory
  fragments", a small structure that holds the memory->{addr, size}.
  All memory requests (as is being done previously) are aligned to
  the cacheline.

  - Allocation: All clients of memory allocation have to "hand-in"
    an opaque platform_memfrag{} handle, which will be returned populated
    with the memory address, and more importantly, the size-of-the-fragment
    that was used to satisfy the memory request.

  - Free: Clients now have to safely keep a handle to this returned
    platform_memfrag{}, and hand it back to the free() method.
    free() will rely "totally" on the size specified in this input
    fragment handle supplied. And the free'd memory fragment will
    be returned to the corresponding free-list bucket.

- Upon free(), the freed-fragment is tracked in a few free-lists
  bucketed by size of the freed-fragment. For now, we support 4 buckets,
  size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient
  for current benchmarking requirements.) A free'd fragment is hung off
  of the corresponding list, threading the free-fragments using
  the fragment's memory itself. New struct free_frag_hdr{} provides the
  threading structure. It tracks the current fragment's size and
  free_frag_next pointer. The 'size' provided to the free() call is
  is recorded as the free'd fragment's size.

- Subsequently, a new alloc() request is 1st satisfied by searching the
  free-list corresponding to the memory request. For example, a request
  from a client for 150 bytes will be rounded-up a cacheline boundary,
  i.e. 192 bytes. The free-list for bucket 256 bytes will be searched
  to find the 1st free-fragment of the right size. If no free fragment
  is found in the target list, we then allocate a new fragment.
  The returned fragment will have a size of 256 (for an original request
  of 150 bytes).

- An immediate consequence of this approach is that there is a small,
  but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(),
  TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z'
  equivalents, which return 0'ed out memory.

- All existing clients of the various TYPED*() memory allocation
  calls have been updated to declare an on-stack platform_memfrag{}
  handle, which is passed back to platform_free().

  - In some places memory is allocated to initialize sub-systems and
    then torn down during deinit(). In a few places existing structures
    are extended to track an additional 'size' field. The size of the
    memory fragment allocated during init() is recorded here, and then
    used to invoke platform_free() as part of the deinit() method.
    An example is clockcache_init() where this kind of work to record
    the 'size' of the fragment is done and passed-down to clockcache_deinit(),
    where the memory fragment is then freed with the right 'size'.
    This pattern is now to be seen in many such init()/deinit() methods
    of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ...

- Cautionary Note:

  If the 'ptr' handed to platform_free() is not of type platform_memfrag{} *,
  it is treated as a generic <struct> *, and its sizeof() will be used
  as the 'size' of the fragment to free. This works in most cases. Except
  for some lapsed cases where, when allocating a structure, the allocator
  ended up selecting a "larger" fragment that just happened to be
  available in the free-list. The consequence is that we might end-up
  free'ing a larger fragment to a smaller-sized free-list. Or, even if
  we do free it to the right-sized bucket, we still end-up marking the
  free-fragment's size as smaller that what it really is. Over time, this
  may add up to a small memory leak, but hasn't been found to be crippling
  in current runs. (There is definitely no issue here with over-writing
  memory due to incorrect sizes.)

- Copious debug and platform asserts have been added in shmem alloc/free
  methods to cross-check to some extent illegal calls.

Fingerprint Object Management:

  Managing memory for fingerprint arrays was particularly problematic.
  This was the case even in a previous commit, before the introduction
  of the memfrag{} approach. Managing fingerprint memory was found to
  be especially cantankerous due to the way filter-building and compaction
  tasks are queued and asynchronously processed by some other
  thread / process.

  The requirements from the new interfaces are handled as follows:

   - Added a new fingerprint{} object, struct fp_hdr{}, which embeds
     at its head a platform_memfrag{}. And few other short fields are
     added for tracking fingerprint memory mgmt gyrations.

   - Various accessor methods are added to manage memory for fingerprint
     arrays through this object. E.g.,

     - fingerprint_init() - Will allocate required fingerprint for 'ntuples'.
     - fingerprint_deinit() - Will dismantle object and free the memory
     - fingerprint_start() - Returns start of fingerprint array's memory
     - fingerprint_nth() - Returns n'th element of fingerprint

  Packaging the handling of fingerprint array through this object and
  its interfaces helped greatly to stabilize the memory histrionics.

- When SplinterDB is closed, shared memory dismantling routine will tag
  any large-fragments that are still found "in-use". This is percolated
  all the way back to splinterdb_close(), unmount() and to
  platform_heap_destory() as a failure $rc. Test will fail if they have
  left some un-freed large fragments. (Similar approach was considered to
  book-keep all small fragments used/freed, but due to some rounding
  errors, it cannot be a reliable check at this time. So hasn't been done.)

Test changes:

Miscellaneous:
 - Elaborate and illustrative tracing added to track memory mgmt done
   for fingerprint arrays, especially when they are bounced around
   queued / re-queued tasks. (Was a very problematic debugging issue.)
 - Extended tests to exercise core memory allocation / free APIs, and
   to exercise fingerprint object mgmt, and writable_buffer interfaces:
    - platform_apis_test:
    - splinter_shmem_test.c: Adds specific test-cases to verify that
        free-list mgmt is happening correctly.
 - Enhanced various diagnostics, asserts, tracing
 - Improved memory usage stats gathering and reporting
 - Added hooks to cross-check multiple-frees of fragments, and testing
   hooks to verify if a free'd fragment is relocated to the right free-list
  - Add diagram for large-free fragment tracking.
This commit reworks the interfaces on the lines discussed in this
discussion thread: #615

void *
platform_alloc(memfrag *mf, int size, ...);

void *
platform_realloc(memfrag *mf, int newsize);

void
platform_free(memfrag *mf); // IN

Currently, the return from `platform_free` still remains as void.
Changing it to platform_status will mean plumbing the return handling
to all callers. Also C-system call `free()` is defined as `void`.
So changing `platform_free` to return platform_status will be a
bit inconsistent.
@gapisback gapisback force-pushed the agurajada/shmem-free-list-mgmt-Rev branch from 3588f3e to 9281c83 Compare February 1, 2024 02:42
@gapisback
Copy link
Collaborator Author

@rtjohnso - My CI-stabilization jobs have succeeded. I have squashed all changes arising from our proposal discussion thread into this one single commit and have refreshed this change-set.

You can restart your review on this amended change-set. (I expect CI-jobs will succeed as they did in the stabilization PR #616 )

@gapisback
Copy link
Collaborator Author

gapisback commented Feb 1, 2024

@rtjohnso : Fyi -- I want to log this one ASAN-instability the most recent round of CI-jobs ran into, as I am not going to remember all this later.

Here is the state of affairs and results of my investigations.

  1. CI Job no. 109 (main-pr-asan) job failed with this error:
 build/release-asan/bin/driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 4 --num-lookup-threads 4 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

build/release-asan/bin/driver_test: splinterdb_build_version 9281c83f

Dispatch test splinter_test

Attempt to create shared segment of size 8589934592 bytes.

Created shared memory of size 8589934592 bytes (8 GiB), shmid=8617984.

Completed setup of shared memory of size 8589934592 bytes (8 GiB), shmaddr=0x7f6924570000, shmid=8617984, available memory = 8589894272 bytes (~7.99 GiB).

filter-index-size: 256 is too small, setting to 512

Running splinter_test with 1 caches

splinter_test: SplinterDB performance test started with 1 tables

splinter_perf_inserts() starting num_insert_threads=4, num_threads=4, num_inserts=27185152 (~27 million) ...

Thread 2 inserting  37% complete for table 0 ... =================================================================

==2666==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f68fba16f80 at pc 0x7f6b276fef50 bp 0x7f69044aa080 sp 0x7f69044a9828

READ of size 589 at 0x7f68fba16f80 thread T1

Thread 2 inserting  42% complete for table 0 ... OS-pid=2666, OS-tid=2669, Thread-ID=3, Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

./test.sh: line 115:  2666 Aborted                 "$@"

make: *** [Makefile:558: run-tests] Error 134
  1. Upon re-run this asan job no. 109.1 succeeded.

  2. Attempted to manually re-run this specific test multiple times on my Nimbus-VM,but could not reproduce the ASAN heap-buffer-overflow error. Ran the exact test with different combinations (one run, 4 concurrent runs with the exact same params, 4 concurrent executions with increasing thread-count up to --num-insert-threads 8 --num-lookup-threads 8, and similar stress load on the VM), but could not repro the problem outside CI.

The last variation of this test in manual repro attempts I tried is 4 concurrent invocations of this test: (Logging this here so I can refer to this later on.)

./driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 8 --num-lookup-threads 8 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

The VM has 16 vCPUs, so I figured by running with 8 insert-threads and 4 concurrent instances, we'd load the CPU high-enough to tickle any bugs out. But the ASAN problem did not recur in these manual repro attempts.


NOTE: In the original failure in CI, hard to tell exactly, but it seems like the thread ID 2 ran into the ASAN memory over flow and soon after, thread ID=3 ran into this assertion a few lines later:

OS-pid=2666, OS-tid=2669, Thread-ID=3,  Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

You may recall that I had reported issue #474 some time ago for this trunk bundle mgmt assertion.

I suspect that there is something lurking there that popped up in the CI-run.

I cannot explain how / whether / if this assertion tripping is caused by the ASAN heap-buffer-overflow error or if they are even related. Unfortunately, I could not repro the ASAN issue outside CI, so have to give up on this investigation now.

The rest of the test runs are stable, and this ASAN-job did succeed on a re-run. I have re-reviewed the code-diffs applied recently and could not find anything obviously broken. For now, I will have to conclude that the changes are fine except there may be some hidden instability popping up, possibly triggered by issue #474 mentioned earlier.

Copy link
Contributor

@rtjohnso rtjohnso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mostly just gone through the headers in the platform code, plus the fingerprint array api.

I assume once we get these nailed down, then most of the changes in the rest of the code will be relatively straightforward updates to the new apis.

Or is there anything else major?

Let's get the new apis sorted and then I can review the whole PR.

@@ -179,7 +179,7 @@ splinterdb_open(splinterdb_config *cfg, splinterdb **kvs);
// Close a splinterdb
//
// This will flush all data to disk and release all resources
void
int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change. Can you add a comment defining the meaning of the return value? e.g.

"returns 0 on success, non-zero otherwise."

Or

"returns

  • 0 on success,
  • a positive integer when all data has been persisted but not all resources were able to be released, and
  • a negative number to indicate that not all data was able to be persisted and the database was unable to shut down safely."

@@ -3167,8 +3173,8 @@ btree_pack_loop(btree_pack_req *req, // IN/OUT
log_trace_key(tuple_key, "btree_pack_loop (bottom)");

if (req->hash) {
platform_assert(req->num_tuples < req->max_tuples);
req->fingerprint_arr[req->num_tuples] =
platform_assert(btree_pack_can_fit_tuple(req));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

platform_histo_handle hh;
hh = TYPED_MANUAL_MALLOC(
hh = TYPED_ARRAY_MALLOC(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I believe the correct macro for this situation is TYPED_FLEXIBLE_STRUCT_ZALLOC.

#define TYPED_ALIGNED_ZALLOC(hid, a, v, n) \

#define TYPED_ALIGNED_MALLOC(hid, a, v, n) \
TYPED_ALIGNED_MALLOC_MF(&memfrag_##v, hid, a, v, n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate this hard-coding of this naming convention, and I think it doesn't solve an important problem.

Let me propose something that might be more useful.

#define TYPED_ALIGNED_MALLOC_AUTO(hid, a, v, n)                                     \
  platform_memfrag v_##memfrag __attribute__(cleanup(platform_free));
  TYPED_ALIGNED_MALLOC_MF(&v_##memfrag, hid, a, v, n)

This allocates memory that will automatically get freed when the function exits the scope of the allocation.

Furthermore, it alleviates the user of the responsibility to declare a memfrag at all.

* Utility macro to test if an argument to platform_free() is a
* platform_memfrag *.
*/
#define IS_MEM_FRAG(x) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will go away, right?

__FILE__, \
__LINE__); \
_mf->addr = NULL; \
_mf->size = 0; \
} while (0)

// Convenience function to free something volatile
static inline void
platform_free_volatile_from_heap(platform_heap_id heap_id,
volatile void *ptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change to accept a memfrag, right?

@@ -40,10 +40,11 @@ platform_checksum_is_equal(checksum128 left, checksum128 right)
static void
platform_free_from_heap(platform_heap_id UNUSED_PARAM(heap_id),
void *ptr,
const size_t size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be modified to take a memfrag, right?

platform_free_volatile_from_heap( \
id, (p), STRINGIFY(p), __func__, __FILE__, __LINE__); \
(p) = NULL; \
debug_assert(((p) != NULL), \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you change platform_free_volatile_from_heap to take a memfrag, you won't need all this type checking.

void *retptr = NULL;
if (heap_id == PROCESS_PRIVATE_HEAP_ID) {
retptr = aligned_alloc(alignment, required);
if (memfrag) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of the if. Always require a memfrag parameter.

static inline void
platform_free_from_heap(platform_heap_id heap_id,
void *ptr,
const size_t size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a memfrag instead.

@gapisback
Copy link
Collaborator Author

@rtjohnso - I've gone thru your review comments quickly. Most of those are easily implementable. I will get to it.

I've mostly just gone through the headers in the platform code, plus the fingerprint array api.

I am curious about your review of the fingerprint array API rework. Did you not find any issues with that? I was bracing myself to get lots of comments as this area is fragile and the rework is a bit tricky. If you think this array API is acceptable, then that will reduce a bunch of rework rounds on me.

Let's get the new apis sorted and then I can review the whole PR.

Let me apply the changes requested and then re-test. (CI-re-test stabilization will be a nightmare starting tomorrow.)

Once I go over all the changes, I will be better able to answer this question of yours:

Or is there anything else major?

... for which the answer now is, I don't think so, off-hand.

@rtjohnso
Copy link
Contributor

rtjohnso commented Feb 2, 2024

I left a few comments on the fingerprint array code already.

I haven't done a full evaluation. It seemed more complex than I expected, but I see that it is trying to make explicit some of the complex sharing that goes on with the fingerprint arrays, which is a goal I like. I will want to do a more thorough review of how it is used to understand how it all fits together.

@rtjohnso
Copy link
Contributor

rtjohnso commented Feb 2, 2024

I spoke with Alex today about the overall design, and he really doesn't like how the whole concept of memfrags puts a burden on the rest of the code.

So let's do the following. Whenever the shm code allocates memory, it allocates one extra cache line in front, and stores the memfrag on that cacheline. Later, during a free, you use pointer arithmetic to find the memfrag for that pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants