Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helgrind failures on concurrent critnib write and read #905

Open
PatKamin opened this issue Nov 18, 2024 · 1 comment
Open

Helgrind failures on concurrent critnib write and read #905

PatKamin opened this issue Nov 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@PatKamin
Copy link
Contributor

Helgrind runs fail randomly on concurrent critnib write and read. Possibly related to the issue #886.

Examples of failed jobs:
https://github.com/oneapi-src/unified-memory-framework/actions/runs/11481262621/job/31999571534
https://github.com/oneapi-src/unified-memory-framework/actions/runs/11719465368/job/32644634889

Environment Information

  • UMF version (hash commit or a tag): main
  • OS(es) version(s): ubuntu-22.04
  • kernel version(s): 6.5.0
  • compiler, libraries, and other related tools version(s): cc 11.4

Please provide a reproduction of the bug:

Although failures seem to occur randomly throughout different PRs from time to time, there is one branch were the issue occurres always. Here is the branch that might be useful for debugging: https://github.com/PatKamin/unified-memory-framework/tree/helgrind-bug

Actual behavior:

==5747== ---Thread-Announcement------------------------------------------
==5747== 
==5747== Thread #1 is the program's root thread
==5747== 
==5747== ----------------------------------------------------------------
==5747== 
==5747== Thread #1: pthread_rwlock_{rd,rw}lock with a pthread_mutex_t* argument 
==5747==    at 0x4852514: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==5747==    by 0x15B0E7: std::__glibcxx_rwlock_wrlock(pthread_rwlock_t*) (shared_mutex:80)
==5747==    by 0x15B13B: std::__shared_mutex_pthread::lock() (shared_mutex:193)
==5747==    by 0x15B1E7: std::shared_timed_mutex::lock() (shared_mutex:473)
==5747==    by 0x15C13F: std::lock_guard<std::shared_timed_mutex>::lock_guard(std::shared_timed_mutex&) (std_mutex.h:229)
==5747==    by 0x157846: Slab::regSlabByAddr(void*, Slab&) (pool_disjoint.cpp:535)
==5747==    by 0x157AD4: Slab::regSlab(Slab&) (pool_disjoint.cpp:563)
==5747==    by 0x157348: Slab::Slab(Bucket&) (pool_disjoint.cpp:463)
==5747==    by 0x15C[371](https://github.com/oneapi-src/unified-memory-framework/actions/runs/11481262621/job/31999571534#step:8:372): std::_MakeUniq<Slab>::__single_object std::make_unique<Slab, Bucket&>(Bucket&) (unique_ptr.h:962)
==5747==    by 0x157E1C: Bucket::getAvailFullSlab(bool&) (pool_disjoint.cpp:617)
==5747==    by 0x157F59: Bucket::getSlab(bool&) (pool_disjoint.cpp:631)
==5747==    by 0x1590E5: DisjointPool::AllocImpl::allocate(unsigned long, bool&) (pool_disjoint.cpp:853)
==5747== 
{
   <insert_a_suppression_name_here>
   Helgrind:Misc
   obj:/usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so
   fun:_ZStL23__glibcxx_rwlock_wrlockP16pthread_rwlock_t
   fun:_ZNSt22__shared_mutex_pthread4lockEv
   fun:_ZNSt18shared_timed_mutex4lockEv
   fun:_ZNSt10lock_guardISt18shared_timed_mutexEC1ERS0_
   fun:_ZN4Slab13regSlabByAddrEPvRS_
   fun:_ZN4Slab7regSlabERS_
   fun:_ZN4SlabC1ER6Bucket
   fun:_ZSt11make_uniqueI4SlabJR6BucketEENSt9_MakeUniqIT_E15__single_objectEDpOT0_
   fun:_ZN6Bucket16getAvailFullSlabERb
   fun:_ZN6Bucket7getSlabERb
   fun:_ZN12DisjointPool9AllocImpl8allocateEmRb
}
==5747== ----------------------------------------------------------------
==5747== 
==5747== Thread #1: pthread_rwlock_unlock with a pthread_mutex_t* argument 
==5747==    at 0x4852CD6: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==5747==    by 0x15B116: std::__glibcxx_rwlock_unlock(pthread_rwlock_t*) (shared_mutex:82)
==5747==    by 0x15B16D: std::__shared_mutex_pthread::unlock() (shared_mutex:213)
==5747==    by 0x15B207: std::shared_timed_mutex::unlock() (shared_mutex:475)
==5747==    by 0x15C162: std::lock_guard<std::shared_timed_mutex>::~lock_guard() (std_mutex.h:235)
==5747==    by 0x15787C: Slab::regSlabByAddr(void*, Slab&) (pool_disjoint.cpp:537)
==5747==    by 0x157AD4: Slab::regSlab(Slab&) (pool_disjoint.cpp:563)
==5747==    by 0x157348: Slab::Slab(Bucket&) (pool_disjoint.cpp:463)
==5747==    by 0x15C371: std::_MakeUniq<Slab>::__single_object std::make_unique<Slab, Bucket&>(Bucket&) (unique_ptr.h:962)
==5747==    by 0x157E1C: Bucket::getAvailFullSlab(bool&) (pool_disjoint.cpp:617)
==5747==    by 0x157F59: Bucket::getSlab(bool&) (pool_disjoint.cpp:631)
==5747==    by 0x1590E5: DisjointPool::AllocImpl::allocate(unsigned long, bool&) (pool_disjoint.cpp:853)
==5747== 
{
   <insert_a_suppression_name_here>
   Helgrind:Misc
   obj:/usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so
   fun:_ZStL23__glibcxx_rwlock_unlockP16pthread_rwlock_t
   fun:_ZNSt22__shared_mutex_pthread6unlockEv
   fun:_ZNSt18shared_timed_mutex6unlockEv
   fun:_ZNSt10lock_guardISt18shared_timed_mutexED1Ev
   fun:_ZN4Slab13regSlabByAddrEPvRS_
   fun:_ZN4Slab7regSlabERS_
   fun:_ZN4SlabC1ER6Bucket
   fun:_ZSt11make_uniqueI4SlabJR6BucketEENSt9_MakeUniqIT_E15__single_objectEDpOT0_
   fun:_ZN6Bucket16getAvailFullSlabERb
   fun:_ZN6Bucket7getSlabERb
   fun:_ZN12DisjointPool9AllocImpl8allocateEmRb
}
==5747== ----------------------------------------------------------------
==5747== 
==5747== Thread #1: pthread_rwlock_{rd,rw}lock with a pthread_mutex_t* argument 
==5747==    at 0x4852704: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==5747==    by 0x15B0B8: std::__glibcxx_rwlock_rdlock(pthread_rwlock_t*) (shared_mutex:78)
==5747==    by 0x15B18F: std::__shared_mutex_pthread::lock_shared() (shared_mutex:229)
==5747==    by 0x15B227: std::shared_timed_mutex::lock_shared() (shared_mutex:489)
==5747==    by 0x15C6D8: std::shared_lock<std::shared_timed_mutex>::shared_lock(std::shared_timed_mutex&) (shared_mutex:727)
==5747==    by 0x15965B: DisjointPool::AllocImpl::deallocate(void*, bool&) (pool_disjoint.cpp:960)
==5747==    by 0x159E95: DisjointPool::free(void*) (pool_disjoint.cpp:1081)
==5747==    by 0x15DB31: _ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENKUlPvDpT_E5_clIJS4_EEEDaS4_S6_ (cpp_helpers.hpp:77)
==5747==    by 0x15FEFA: _ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENUlPvDpT_E5_4_FUNIJS4_EEEDTcldtdeLKPKS7_0EonclIJS6_EEscOS4_fp_spscOS5_fp0_EES4_S6_ (cpp_helpers.hpp:77)
==5747==    by 0x162B84: umfPoolFree (memory_pool.c:171)
==5747==    by 0x1549E9: free_memory(umf_memory_pool_t*, void**) (malloc_compliance_tests.cpp:40)
==5747==    by 0x154F03: malloc_compliance_test(umf_memory_pool_t*) (malloc_compliance_tests.cpp:76)
==5747== 
{
   <insert_a_suppression_name_here>
   Helgrind:Misc
   obj:/usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so
   fun:_ZStL23__glibcxx_rwlock_rdlockP16pthread_rwlock_t
   fun:_ZNSt22__shared_mutex_pthread11lock_sharedEv
   fun:_ZNSt18shared_timed_mutex11lock_sharedEv
   fun:_ZNSt11shared_lockISt18shared_timed_mutexEC1ERS0_
   fun:_ZN12DisjointPool9AllocImpl10deallocateEPvRb
   fun:_ZN12DisjointPool4freeEPv
   fun:_ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENKUlPvDpT_E5_clIJS4_EEEDaS4_S6_
   fun:_ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENUlPvDpT_E5_4_FUNIJS4_EEEDTcldtdeLKPKS7_0EonclIJS6_EEscOS4_fp_spscOS5_fp0_EES4_S6_
   fun:umfPoolFree
   fun:_ZL11free_memoryP17umf_memory_pool_tPPv
   fun:_Z22malloc_compliance_testP17umf_memory_pool_t
}

Helgrind: hg_main.c:308 (lockN_acquire_reader): Assertion 'lk->kind == LK_rdwr' failed.

host stacktrace:
==5747==    at 0x58026E7A: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747==    by 0x58026FA7: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747==    by 0x5802714B: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747==    by 0x58006519: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747==    by 0x5800E5CA: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747==    by 0x5800EB6A: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747==    by 0x58038[374](https://github.com/oneapi-src/unified-memory-framework/actions/runs/11481262621/job/31999571534#step:8:375): ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747==    by 0x5807F46A: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)
==5747==    by 0x580C9E70: ??? (in /usr/libexec/valgrind/helgrind-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 5747)
==5747==    at 0x48527CF: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==5747==    by 0x15B0B8: std::__glibcxx_rwlock_rdlock(pthread_rwlock_t*) (shared_mutex:78)
==5747==    by 0x15B18F: std::__shared_mutex_pthread::lock_shared() (shared_mutex:229)
==5747==    by 0x15B227: std::shared_timed_mutex::lock_shared() (shared_mutex:489)
==5747==    by 0x15C6D8: std::shared_lock<std::shared_timed_mutex>::shared_lock(std::shared_timed_mutex&) (shared_mutex:727)
==5747==    by 0x15965B: DisjointPool::AllocImpl::deallocate(void*, bool&) (pool_disjoint.cpp:960)
==5747==    by 0x159E95: DisjointPool::free(void*) (pool_disjoint.cpp:1081)
==5747==    by 0x15DB31: _ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENKUlPvDpT_E5_clIJS4_EEEDaS4_S6_ (cpp_helpers.hpp:77)
==5747==    by 0x15FEFA: _ZZN3umf6detail11poolOpsBaseI12DisjointPoolEE21umf_memory_pool_ops_tvENUlPvDpT_E5_4_FUNIJS4_EEEDTcldtdeLKPKS7_0EonclIJS6_EEscOS4_fp_spscOS5_fp0_EES4_S6_ (cpp_helpers.hpp:77)
==5747==    by 0x162B84: umfPoolFree (memory_pool.c:171)
==5747==    by 0x1549E9: free_memory(umf_memory_pool_t*, void**) (malloc_compliance_tests.cpp:40)
==5747==    by 0x154F03: malloc_compliance_test(umf_memory_pool_t*) (malloc_compliance_tests.cpp:76)
==5747==    by 0x11E661: umfPoolTest_malloc_compliance_Test::TestBody() (poolFixtures.hpp:439)
==5747==    by 0x1BBABE: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2638)
==5747==    by 0x1B3DBC: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2674)
==5747==    by 0x18D3FB: testing::Test::Run() (gtest.cc:2713)
==5747==    by 0x18DE8B: testing::TestInfo::Run() (gtest.cc:2859)
==5747==    by 0x18E829: testing::TestSuite::Run() (gtest.cc:3037)
==5747==    by 0x19F19D: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5967)
==5747==    by 0x1BCBE1: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2638)
==5747==    by 0x1B5062: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2674)
==5747==    by 0x19D7AB: testing::UnitTest::Run() (gtest.cc:5546)
==5747==    by 0x162258: RUN_ALL_TESTS() (gtest.h:2334)
==5747==    by 0x1621D1: main (gtest_main.cc:64)
client stack range: [0x1FFEFFA000 0x1FFF000FFF] client SP: 0x1FFEFFE6F0
valgrind stack range: [0x100328E000 0x100338DFFF] top usage: 18984 of 1048576

How often bug is revealed:

often, in some cases always
(always, often, rare)

Details

Most probably a proper valgrind instrumentation needs to be added to critnib.

Additional information about Priority and Help Requested:

Are you willing to submit a pull request with a proposed change? (Yes, No) No

Requested priority: (Showstopper, High, Medium, Low) Medium

@PatKamin PatKamin added the bug Something isn't working label Nov 18, 2024
@lplewa
Copy link
Contributor

lplewa commented Nov 19, 2024

This will be fixed probably by #898, so probably there is no need to debug this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants