Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L0 provider cannot find L0 symbols when dlopen is used. #926

Open
igchor opened this issue Nov 22, 2024 · 10 comments
Open

L0 provider cannot find L0 symbols when dlopen is used. #926

igchor opened this issue Nov 22, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@igchor
Copy link
Member

igchor commented Nov 22, 2024

Problematic scenario:
L0 UR adapter uses L0 provider from UMF. L0 UR adapter is being dlopened by the loader (which application links to). When application itself also links with UMF and uses umfPoolGetMemoryProvider (or perahps any UMF symbol?) then L0 provider cannot find symbols.

How to reproduce:

git clone https://github.com/igchor/umf_repro
cd umf_repro
mkdir build
cd build
# uncomment include_directories and link_directories from CMakeLists.txt and change them to point to a proper directory
cmake ..
make
LD_DEBUG=files LD_LIBRARY_PATH=. ./test

Output:

...

   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemAllocHost (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemAllocDevice (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemAllocShared (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemFree (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemGetIpcHandle (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemPutIpcHandle (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemOpenIpcHandle (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemCloseIpcHandle (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeContextMakeMemoryResident (fatal)
   1159394:     /home/igchor/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeDeviceGetProperties (fatal)
ERROR: umfMemoryProviderCreate
Segmentation fault (core dumped)

If I remove the call to umfPoolByPtr from main.c then the binary works (allocates memory).

@igchor igchor added the bug Something isn't working label Nov 22, 2024
@vinser52
Copy link
Contributor

I was able to reproduce it. Investigating.

@vinser52 vinser52 self-assigned this Nov 27, 2024
@vinser52
Copy link
Contributor

vinser52 commented Dec 9, 2024

I have found the root cause of the issue.

The reproducer does the following:

  1. There is a test executable that is linked with libumf.so and it loads via dlopen the libipc.so library with the following flags RTLD_LAZY | RTLD_LOCAL.
  2. The libipc.so library is linked with the libumf.so and the libze_loader.so.
  3. The libipc.so library loaded successfully and the test using dlsym finds the alloc() function from the libipc.so and calls it.
  4. The alloc() function creates LevelZero Memory pool using UMF API. When it calls the umfMemoryProviderCreate function, internally UMF Level Zero provider calls the init_ze_global_state to find Level Zero symbols. According to the LD_DEBUG=symbols the loader lookups in the following binaries:
   2499740:     symbol=_dl_find_dso_for_object;  lookup in file=/user/svinogra/experiments/umf_repro/build/test [0]
   2499740:     symbol=_dl_find_dso_for_object;  lookup in file=/user/svinogra/unified-memory-framework/build/lib/libumf.so.0 [0]
   2499740:     symbol=_dl_find_dso_for_object;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
   2499740:     symbol=_dl_find_dso_for_object;  lookup in file=/opt/intel/oneapi/tbb/2021.13/env/../lib/intel64/gcc4.8/libhwloc.so.15 [0]
   2499740:     symbol=_dl_find_dso_for_object;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
   2499740:     symbol=zeMemAllocHost;  lookup in file=/user/svinogra/experiments/umf_repro/build/test [0]
   2499740:     symbol=zeMemAllocHost;  lookup in file=/user/svinogra/unified-memory-framework/build/lib/libumf.so.0 [0]
   2499740:     symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
   2499740:     symbol=zeMemAllocHost;  lookup in file=/opt/intel/oneapi/tbb/2021.13/env/../lib/intel64/gcc4.8/libhwloc.so.15 [0]
   2499740:     symbol=zeMemAllocHost;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
   2499740:     symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
   2499740:     /user/svinogra/unified-memory-framework/build/lib/libumf.so.0: error: symbol lookup error: undefined symbol: zeMemAllocHost (fatal)

If we change the test executable to load the libipc.so with the RTLD_LOCAL flag everything works as expected. And the LD_DEBUG=symbols logs are the following:

   2500341:     symbol=_dl_find_dso_for_object;  lookup in file=/user/svinogra/experiments/umf_repro/build/test [0]
   2500341:     symbol=_dl_find_dso_for_object;  lookup in file=/user/svinogra/unified-memory-framework/build/lib/libumf.so.0 [0]
   2500341:     symbol=_dl_find_dso_for_object;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
   2500341:     symbol=_dl_find_dso_for_object;  lookup in file=/opt/intel/oneapi/tbb/2021.13/env/../lib/intel64/gcc4.8/libhwloc.so.15 [0]
   2500341:     symbol=_dl_find_dso_for_object;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
   2500341:     symbol=zeMemAllocHost;  lookup in file=/user/svinogra/experiments/umf_repro/build/test [0]
   2500341:     symbol=zeMemAllocHost;  lookup in file=/user/svinogra/unified-memory-framework/build/lib/libumf.so.0 [0]
   2500341:     symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
   2500341:     symbol=zeMemAllocHost;  lookup in file=/opt/intel/oneapi/tbb/2021.13/env/../lib/intel64/gcc4.8/libhwloc.so.15 [0]
   2500341:     symbol=zeMemAllocHost;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
   2500341:     symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
   2500341:     symbol=zeMemAllocHost;  lookup in file=./libipc.so [0]
   2500341:     symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libze_loader.so.1 [0]

@bratpiorka
Copy link
Contributor

This is very strange. Call to init_level_zero from lib.c for sure loads ze_loader.so - we would not be able to create L0 context otherwise. So we know that ze_loader.so is loaded and used before calling umfMemoryProviderCreate. What is the difference in symbol initialization between init_level_zero and in umfMemoryProviderCreate?

@vinser52
Copy link
Contributor

vinser52 commented Dec 9, 2024

This is very strange. Call to init_level_zero from lib.c for sure loads ze_loader.so - we would not be able to create L0 context otherwise. So we know that ze_loader.so is loaded and used before calling umfMemoryProviderCreate. What is the difference in symbol initialization between init_level_zero and in umfMemoryProviderCreate?

My understanding is the following:

  1. The lib.c is linked with the libze_loader.so. When the lib.c is loaded the loader also loads the libze_loader.so library into a process space. And the init_level_zero uses symbols from the libze_loader.so. Everything is OK with that.
  2. When the umfMemoryProviderCreate function is called the UMF uses dlsym to search for Level Zero symbols (e.g. zeMemAllocHost, etc). Here is a quote from the dlopen man:
       RTLD_GLOBAL
              The symbols defined by this shared object will be made
              available for symbol resolution of subsequently loaded
              shared objects.

       RTLD_LOCAL
              This is the converse of RTLD_GLOBAL, and the default if
              neither flag is specified.  Symbols defined in this shared
              object are not made available to resolve references in
              subsequently loaded shared objects.

Since the lib.c is loaded with the RTLD_LOCAL flag, its symbols and symbols of its dependencies (libze_loader.so) are not globally visible. The UMF calls the dlsym with the RTLD_DEFAULT handle. Here is the corresponding quotes from the dlsym man:

RTLD_DEFAULT
              Find the first occurrence of the desired symbol using the
              default shared object search order.  The search will
              include global symbols in the executable and its
              dependencies, as well as symbols in shared objects that
              were dynamically loaded with the RTLD_GLOBAL flag.

So as we can see in the case of the RTLD_DEFAULT, the search operations include symbols of dynamically loaded objects with the RTLD_GLOBAL flag only.

@vinser52
Copy link
Contributor

vinser52 commented Dec 9, 2024

@igchor Could you please clarify how it maps to the level zero adapter implementation and its v2 version?

@igchor
Copy link
Member Author

igchor commented Dec 9, 2024

@vinser52 all adapters are loaded with RTLD_LOCAL currently: https://github.com/oneapi-src/unified-runtime/blob/d3b81bfc88cc896b16634a5c602422a3aff5f4d1/source/common/linux/ur_lib_loader.cpp#L38 We could discuss changing this - I don't know what would be the exact impact.

Also, @vinser52 do you know why the reproducer only fails if there is a call to umf function in main.c? If I remove umfPoolByPtr call everything works fine, even with RTLD_LOCAL.

@vinser52
Copy link
Contributor

vinser52 commented Dec 9, 2024

@vinser52 all adapters are loaded with RTLD_LOCAL currently: https://github.com/oneapi-src/unified-runtime/blob/d3b81bfc88cc896b16634a5c602422a3aff5f4d1/source/common/linux/ur_lib_loader.cpp#L38 We could discuss changing this - I don't know what would be the exact impact.

@igchor But you said that the issue only relevant for the v2 of L0 adapter. Do you have any ideas what is the difference between v1 and v2 of L0 adapter in that regards?

Also, @vinser52 do you know why the reproducer only fails if there is a call to umf function in main.c? If I remove umfPoolByPtr call everything works fine, even with RTLD_LOCAL.

hmm, I forgot about that case. This is what I see in the LD_DEBUG=symbols,file logs:

  1. With RTLD_LOCAL and the umfPoolByPtr call.
symbol=zeMemAllocHost;  lookup in file=./test [0]
symbol=zeMemAllocHost;  lookup in file=/home/vinser52/repos/unified-memory-framework/build/umf_install/lib/libumf.so.0 [0]
symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
symbol=zeMemAllocHost;  lookup in file=/opt/intel/oneapi/tcm/1.2/lib/libhwloc.so.15 [0]
symbol=zeMemAllocHost;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  1. With RTLD_GLOBAL and the umfPoolByPtr call.
symbol=zeMemAllocHost;  lookup in file=./test [0]
symbol=zeMemAllocHost;  lookup in file=/home/vinser52/repos/unified-memory-framework/build/umf_install/lib/libumf.so.0 [0]
symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
symbol=zeMemAllocHost;  lookup in file=/opt/intel/oneapi/tcm/1.2/lib/libhwloc.so.15 [0]
symbol=zeMemAllocHost;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
symbol=zeMemAllocHost;  lookup in file=./libipc.so [0]
symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libze_loader.so.1 [0]
  1. With RTLD_LOCAL but without the umfPoolByPtr call.
symbol=zeMemAllocHost;  lookup in file=./test [0]
symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
symbol=zeMemAllocHost;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
symbol=zeMemAllocHost;  lookup in file=./libipc.so [0]
symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libze_loader.so.1 [0]

file=/lib/x86_64-linux-gnu/libze_loader.so.1 [0];  needed by /home/vinser52/repos/unified-memory-framework/build/umf_install/lib/libumf.so.0 [0] (relocation dependency)

So when the umfPoolByPtr call is removed the test is no longer depends/links with libumf.so - linker is smart enough and if there is no actual dependency to the library it is removed even if it is specified in CMakeLists.txt. Here is ldd output:

ldd ./test
	linux-vdso.so.1 (0x00007ffc3db8f000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000741877600000)
	/lib64/ld-linux-x86-64.so.2 (0x0000741877932000)

In the 1st and 2nd cases, the libumf.so is loaded as a dependency of the test. In the 3rd case, the libumf.so is loaded as a dependency of the libipc.so.

But I do not understand these lines in the LD_DEBUG=symbols,file logs when the umfPoolByPtr call is removed:

symbol=zeMemAllocHost;  lookup in file=/lib/x86_64-linux-gnu/libze_loader.so.1 [0]

file=/lib/x86_64-linux-gnu/libze_loader.so.1 [0];  needed by /home/vinser52/repos/unified-memory-framework/build/umf_install/lib/libumf.so.0 [0] (relocation dependency)

Will continue investigation.

@bratpiorka
Copy link
Contributor

The reasons we load symbols in UMF this way is are:

  1. assumption that if one ones to create a L0 provider and pass the context as the parameter, it must be already created using L0 API, so the ze_loader should be already present
  2. assumption that ze_loader symbols would be also usable for UMF

Here the 1st is correct, but 2nd not. What I would like to suggest is to look at /proc/self/maps and check for path to ze_loader.
If it there but the symbols are unavailable for us, the we could simple open it using dlopen(LOCAL). This way we avoid searching for this so by ourself and potential symbol mismatch problem.

@vinser52
Copy link
Contributor

If it there but the symbols are unavailable for us, the we could simple open it using dlopen(LOCAL). This way we avoid searching for this so by ourself and potential symbol mismatch problem.

I had a similar idea yesterday in mind, but it was too late. Will check it today.

@vinser52
Copy link
Contributor

I just checked the following:

Changed the init_ze_global_state to open the libze_loader.so using utils_open_library and pass handle to the libze_loader.so to the utils_get_symbol_addr function. With such a patch, the reproducer works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants