Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure while running multi-thread concurrent inserts workload: trunk_split_leaf(): "(num_leaves + trunk_num_pivot_keys(spl, parent) <= spl->cfg.max_pivot_keys)" #467

Open
gapisback opened this issue Oct 28, 2022 · 1 comment
Assignees

Comments

@gapisback
Copy link
Collaborator

gapisback commented Oct 28, 2022

In dev branch agurajada/467-large-inserts-trunk-assert-bug off of /main, a new collection of multi-threaded heavy-inserts workload is being developed. One of the cases runs into this assertion:

$ build/debug/bin/unit/large_inserts_bugs_stress_test --num-inserts 5000000 --num-threads 6 test_seq_key_fully_packed_value_inserts_threaded_same_start_keyid

[...]
exec_worker_thread()::489:Thread 6  inserts 5000000 (5 million), sequential key, fully-packed constant value, KV-pairs starting from 0 (0) ...
OS-pid=1842079, Thread-ID=6, Insert fully-packed fixed value of length=256 bytes.
Assertion failed at src/trunk.c:5461:trunk_split_leaf(): "(num_leaves + trunk_num_pivot_keys(spl, parent) <= spl->cfg.max_pivot_keys)". num_leaves=6, trunk_num_pivot_keys()=9, cfg.max_pivot_keys=14

NOTE: Before you can repro this you need to pull-in in-flight fix for issue #458; Otherwise you will run into that assertion first. The dev-branch where this repro has been constructed, agurajada/467-large-inserts-trunk-assert-bug, pulls-in that commit already.

This test case was synthesized to reliably repro this problem that was seen during manual testing and test-dev.

The key-points needed for the repro are:

  • Need to insert large'ish #s of rows using --num-inserts arg. Works with upwards of 1-2 Million inserts / thread
  • Need more than a few threads. Sometimes this repros with --num-threads 4 also.
  • Something about this test case is peculiar in the way it repros the issue. All threads start from the same key-ID of 0, so we are doing duplicate key inserts, essentially. TEST_KEY_SIZE = 30, but not sure if this value specifically makes a difference.
  • I needed to use fully-packed constant value of length TEST_VALUE_SIZE 256 in order to reliably repro this assertion.
  • It's a bit unreliable, so you will have to run this a few times to get the assertion.

See this internal slack thread where this issue was aired out on a private dev branch first, before repro'ing this off of /main.


The other thing about this test is this part of the configuration:

111    data->cfg = (splinterdb_config){.filename   = TEST_DB_NAME,
112                                    .cache_size = 256 * Mega,
113                                    .disk_size  = 40 * Giga,

I was trying with 64MiB cache and that some times works. Often times, we will run into unable to find a free buffer error from clockcache.c, ... so to avoid those noise errors, I settled on 256MiB cache, which should be small enough to induce lots of IOs to disk.

@gapisback
Copy link
Collaborator Author

gapisback commented Oct 28, 2022

@ajhconway - Update of re-running all test cases in this new test with different parameters:

  1. With --num-inserts 5000000 --num-threads 6 - all cases passed.
  2. With --num-inserts 10000000 --num-threads 8 - running ... test case test_seq_key_random_values_inserts failed with this assertion:
TEST 3/11 large_inserts_bugs_stress:test_seq_key_random_values_inserts Fingerprint size 29 too large, max value size is 5, setting to 27
fingerprint_size: 27
filter-index-size: 256 is too small, setting to 512
exec_worker_thread()::489:**Thread 0**  inserts 10000000 (10 million), sequential key, random value, KV-pairs starting from 0 (0) ...
OS-pid=1842247, Thread-ID=0, Insert random value of fixed-length=256 bytes.
Assertion failed at src/trunk.c:3602:trunk_inc_filter(): "filter->addr != 0".

Stack at this point of failure when running just this one test case under gdb is:

OS-pid=1842287, Thread-ID=0, Insert random value of fixed-length=256 bytes.
Assertion failed at src/trunk.c:3602:trunk_inc_filter(): "filter->addr != 0".

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737349751616) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737349751616) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737349751616) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737349751616, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7d15476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7cfb7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055555555c266 in platform_assert_false (filename=0x7ffff7fa3f9b "src/trunk.c", linenumber=3602,
    functionname=0x7ffff7fa8710 <__FUNCTION__.59> "trunk_inc_filter",
    expr=0x7ffff7fa4582 "filter->addr != 0", message=0x7ffff7fa396a "") at src/platform_linux/platform.c:344
#6  0x00007ffff7f7b447 in trunk_inc_filter (spl=0x7fffe4f81040, filter=0x7fffeb14c2d2) at src/trunk.c:3602
#7  0x00007ffff7f7df46 in trunk_flush_into_bundle (spl=0x7fffe4f81040, parent=0x7ffff50a23e0,
    child=0x7ffff50a4d98, pdata=0x7fffeaf8f544, req=0x555555588f40) at src/trunk.c:4165
#8  0x00007ffff7f7e666 in trunk_flush (spl=0x7fffe4f81040, parent=0x7ffff50a23e0, pdata=0x7fffeaf8f544,
    is_space_rec=0) at src/trunk.c:4277
#9  0x00007ffff7f7ec5d in trunk_flush_fullest (spl=0x7fffe4f81040, node=0x7ffff50a23e0) at src/trunk.c:4359
#10 0x00007ffff7f7f98e in trunk_compact_bundle (arg=0x5555555d8700, scratch_buf=0x7ffff79e8ae8)
    at src/trunk.c:4685
#11 0x00007ffff7f97e20 in task_group_perform_one (group=0x7ffff79e7800) at src/task.c:687
#12 0x00007ffff7f97f8f in task_perform_one (ts=0x7ffff79e5040) at src/task.c:713
#13 0x00007ffff7f8577d in trunk_insert (spl=0x7fffe4f81040, key=0x7fffffffe180 "\a5925139", data=...)
    at src/trunk.c:6197
#14 0x00007ffff7f698fe in splinterdb_insert_message (kvs=0x55555556c700, key=..., msg=...)
    at src/splinterdb.c:710
#15 0x00007ffff7f699ac in splinterdb_insert (kvsb=0x55555556c700, key=..., value=...) at src/splinterdb.c:718
#16 0x000055555555ab82 in exec_worker_thread (w=0x7fffffffe4a0)
  1. For the same combination as above, 8 threads each inserting 10 Million rows, all test-cases passed when run individually.

This shows that there is no gross instability, but there may be lurking timing- / concurrency-related issues which would only surface upon multiple repeated re-runs, with different combinations of --num-inserts and --num-threads.

gapisback added a commit that referenced this issue Oct 28, 2022
Rework large_inserts_bugs_stress test to generate assertion failure.

This commit slightly enhances do_inserts_n_threads() in this
test case to cajole an assertion seen from BTree split code, or
thereabouts:

OS-pid=1839020, Thread-ID=5, Assertion failed at src/trunk.c:5521:trunk_split_leaf(): "(num_leaves + trunk_num_pivot_keys(spl, parent) <= spl->cfg.max_pivot_keys)". num_leaves=6, trunk_num_pivot_keys()=9, cfg.max_pivot_keys=14

The changes are:
- Provide options to use same / diff start-key for each thread.
- Increase TEST_KEY_SIZE to 30 and TEST_VALUE_SIZE to 256 bytes.
- Provide an option to either generate sequential values or to
  use fully-packed values for each key. The latter seems to be
  the condition that triggers this assertion.

Many diff variations of test cases are provided in this one large
framework. See large_inserts_bugs_stress_test --list for names of
individual test cases.
gapisback added a commit that referenced this issue Dec 7, 2022
Rework large_inserts_bugs_stress test to generate assertion failure.

This commit slightly enhances do_inserts_n_threads() in this
test case to cajole an assertion seen from BTree split code, or
thereabouts:

OS-pid=1839020, Thread-ID=5, Assertion failed at src/trunk.c:5521:trunk_split_leaf(): "(num_leaves + trunk_num_pivot_keys(spl, parent) <= spl->cfg.max_pivot_keys)". num_leaves=6, trunk_num_pivot_keys()=9, cfg.max_pivot_keys=14

The changes are:
- Provide options to use same / diff start-key for each thread.
- Increase TEST_KEY_SIZE to 30 and TEST_VALUE_SIZE to 256 bytes.
- Provide an option to either generate sequential values or to
  use fully-packed values for each key. The latter seems to be
  the condition that triggers this assertion.

Many diff variations of test cases are provided in this one large
framework. See large_inserts_bugs_stress_test --list for names of
individual test cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants