-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSHMEM/MCA/SSHMEM/UCX: DEVICE_NIC_MEM hint - implementation should use RDMA memory type #11866
Conversation
roiedanino
commented
Aug 21, 2023
- Updated the implementation for using the DEVICE_NIC_MEM to use ucp_mem_map with the new RDMA memory type.
- Cleanup for old code which used uct api
Hello! The Git Commit Checker CI bot found a few problems with this PR: 33a0760: Removed device mem
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
33a0760
to
946a131
Compare
5dffffe
to
e853c37
Compare
oshmem/mca/sshmem/ucx/configure.m4
Outdated
#include <ucp/api/ucp.h> | ||
]], | ||
[[ | ||
ucs_memory_type_t mem_type = UCS_MEMORY_TYPE_RDMA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just AC_CHECK_DECLS([UCS_MEMORY_TYPE_RDMA], ...)
@@ -176,19 +171,39 @@ segment_create_internal(map_segment_t *ds_buf, void *address, size_t size, | |||
return rc; | |||
} | |||
|
|||
static int | |||
segment_create_host_mem(map_segment_t *ds_buf, size_t size, unsigned flags) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we don't need this wrapper
status = segment_create_internal(ds_buf, NULL, size, flags, UCS_MEMORY_TYPE_RDMA); | ||
if (status == OSHMEM_SUCCESS) { | ||
ds_buf->alloc_hints = hint; | ||
if (hint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
} | ||
return segment_create_internal(ds_buf, mca_sshmem_base_start_address, size, flags, UCS_MEMORY_TYPE_HOST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line too long
uct_ib_md_release_device_mem(dev_mem); | ||
/* fallback to regular allocation */ | ||
} | ||
status = segment_create_internal(ds_buf, NULL, size, flags, UCS_MEMORY_TYPE_RDMA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line seems too long
Hello! The Git Commit Checker CI bot found a few problems with this PR: f1671c2: Fixed CR - using DECLS auto generated macros
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
f1671c2
to
4d16347
Compare
4d16347
to
64e931b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls squash
64e931b
to
2858a93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gleon99 can you pls take a look as well?
|
||
#if HAVE_UCX_DEVICE_MEM | ||
int ret = OSHMEM_ERROR; | ||
#if HAVE_DECL_UCS_MEMORY_TYPE_RDMA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an indicative debug message for the user, in case one had passed a hint which was not applied eventually. It can happen in 3 scenarios:
NIC_MEM
passed but the#if
is false.NIC_MEM
passed butcreate_internal
failed -> fallback to host memory..- Unknown / invalid "hint" was passed.
Hello! The Git Commit Checker CI bot found a few problems with this PR: 2fe0dcd: MCA/SSHMEM/UCX: Added warning messages for fallbac...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
5456bf4
to
e37a850
Compare
/* Fallback - Try again using host memory*/ | ||
} | ||
#else | ||
SSHMEM_WARN("UCX version is too old and won't support " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? It would be printed independent of the hint, even if 0..
@@ -189,8 +189,18 @@ segment_create(map_segment_t *ds_buf, | |||
ds_buf->allocator = &sshmem_ucx_allocator; | |||
return OSHMEM_SUCCESS; | |||
} | |||
else if(hint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If includes the possibility of hint == SHMEM_HINT_DEVICE_NIC_MEM
. What's the reasoning?
SSHMEM_WARN("UCX version is too old and won't support " | ||
"SHMEM_HINT_DEVICE_NIC_MEM hint, fallback to host memory"); | ||
#else | ||
SSHMEM_WARN("UCX is not supporting SHMEM_HINT_DEVICE_NIC_MEM " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSHMEM_VERBOSE(1, "DEVICE_NIC_MEM hint ignored since UCX does not support MEMORY_TYPE_RDMA")
static int | ||
segment_create(map_segment_t *ds_buf, | ||
const char *file_name, | ||
size_t size, long hint) | ||
{ | ||
mca_spml_ucx_t *spml = (mca_spml_ucx_t*)mca_spml.self; | ||
unsigned flags; | ||
unsigned flags = UCP_MEM_MAP_ALLOCATE; | ||
int status = OSHMEM_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this init value is not used in any scenario?
} | ||
if (hint & ~SHMEM_HINT_DEVICE_NIC_MEM) { | ||
SSHMEM_WARN("Hint was not recognized therefore ignored, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Better used unknown/invalid instead of "was not recognized.
- Print the passed hint and the "allowed" hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we don't need to print any error in this case
SSHMEM_WARN("UCX version is too old and won't support " | ||
"SHMEM_HINT_DEVICE_NIC_MEM hint, fallback to host memory"); | ||
#else | ||
SSHMEM_WARN("UCX is not supporting SHMEM_HINT_DEVICE_NIC_MEM " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSHMEM_VERBOSE(1, "DEVICE_NIC_MEM hint ignored since UCX does not support MEMORY_TYPE_RDMA")
} | ||
if (hint & ~SHMEM_HINT_DEVICE_NIC_MEM) { | ||
SSHMEM_WARN("Hint was not recognized therefore ignored, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we don't need to print any error in this case
Hello! The Git Commit Checker CI bot found a few problems with this PR: 1f8f040: SSHMEM/UCX: fixed logging
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
opal/mca/common/ucx/common_ucx.h
Outdated
MCA_COMMON_UCX_VERBOSE(1, "%s failed: %d, %s", (_msg) ? (_msg) : __func__, \ | ||
UCS_PTR_STATUS(_request), \ | ||
ucs_status_string(UCS_PTR_STATUS(_request))); \ | ||
MCA_COMMON_UCX_VERBOSE(1, "%s failed2: %d, %s", (_msg) ? (_msg) : __func__,\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover
1f8f040
to
ce61ab5
Compare
@roiedanino next time pls don't force push |
|
||
status = ucp_mem_map(spml->ucp_context, &mem_map_params, &mem_h); | ||
if (UCS_OK != status) { | ||
SSHMEM_ERROR("ucp_mem_map() failed: %s\n", ucs_status_string(status)); | ||
SSHMEM_ERROR("Failed to allocate DEVICE_NIC_MEM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this error msg is indicative.
In this particular context, (in theory) it might not necessarily be only about DEVICE_NIC_MEM...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSHMEM_ERROR("Failed to allocate DEVICE_NIC_MEM" | |
SSHMEM_ERROR("ucp_mem_map() failed: %s %s\n", | |
ucs_status_string(status), | |
mem_type == UCS_MEMORY_TYPE_RDMA ? | |
"failed to allocate DEVICE_NIC_MEM, falls back to host memory" : ""); |
Maybe that way?
eab79be
to
1577cec
Compare
} else { | ||
SSHMEM_WARN("Failed to allocate DEVICE_NIC_MEM: %s, " | ||
"fallback to host memory", ucs_status_string(status)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
@@ -123,7 +123,9 @@ segment_create_internal(map_segment_t *ds_buf, void *address, size_t size, | |||
|
|||
status = ucp_mem_map(spml->ucp_context, &mem_map_params, &mem_h); | |||
if (UCS_OK != status) { | |||
SSHMEM_ERROR("ucp_mem_map() failed: %s\n", ucs_status_string(status)); | |||
SSHMEM_VERBOSE(err_level, "ucp_mem_map(%s) failed: %s\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_type=%s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls squash
0f18386
to
7aa1f28
Compare
if (status == OSHMEM_SUCCESS) { | ||
ds_buf->alloc_hints = hint; | ||
ds_buf->allocator = &sshmem_ucx_allocator; | ||
return OSHMEM_SUCCESS; | ||
} | ||
#else | ||
SSHMEM_VERBOSE(1, "DEVICE_NIC_MEM hint ignored since UCX does not " | ||
SSHMEM_VERBOSE(20, "DEVICE_NIC_MEM hint ignored since UCX does not " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use level 3
return OSHMEM_ERR_BAD_PARAM; | ||
#if HAVE_DECL_UCS_MEMORY_TYPE_RDMA | ||
status = segment_create_internal(ds_buf, NULL, size, flags, | ||
UCS_MEMORY_TYPE_RDMA, 20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use level 3 instead of 20
type Signed-off-by: Roie Danino <[email protected]> Added a fallback for rdma allocation failure - allocating host memory instead Signed-off-by: Roie Danino <[email protected]>
6131002
to
b192a78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe can we merge this one?