Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSHMEM/MCA/SSHMEM/UCX: DEVICE_NIC_MEM hint - implementation should use RDMA memory type #11866

Merged
merged 1 commit into from
Oct 8, 2023

Conversation

roiedanino
Copy link
Contributor

  • Updated the implementation for using the DEVICE_NIC_MEM to use ucp_mem_map with the new RDMA memory type.
  • Cleanup for old code which used uct api

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

33a0760: Removed device mem

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@roiedanino roiedanino force-pushed the minor-fixes-for-memic-support branch from 33a0760 to 946a131 Compare August 21, 2023 08:17
oshmem/mca/sshmem/ucx/sshmem_ucx_module.c Outdated Show resolved Hide resolved
oshmem/mca/sshmem/ucx/sshmem_ucx_module.c Outdated Show resolved Hide resolved
oshmem/mca/sshmem/ucx/sshmem_ucx.h Outdated Show resolved Hide resolved
@roiedanino roiedanino force-pushed the minor-fixes-for-memic-support branch from 5dffffe to e853c37 Compare August 21, 2023 11:17
#include <ucp/api/ucp.h>
]],
[[
ucs_memory_type_t mem_type = UCS_MEMORY_TYPE_RDMA;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just AC_CHECK_DECLS([UCS_MEMORY_TYPE_RDMA], ...)

@@ -176,19 +171,39 @@ segment_create_internal(map_segment_t *ds_buf, void *address, size_t size,
return rc;
}

static int
segment_create_host_mem(map_segment_t *ds_buf, size_t size, unsigned flags) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't need this wrapper

status = segment_create_internal(ds_buf, NULL, size, flags, UCS_MEMORY_TYPE_RDMA);
if (status == OSHMEM_SUCCESS) {
ds_buf->alloc_hints = hint;
if (hint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?

oshmem/mca/sshmem/ucx/configure.m4 Outdated Show resolved Hide resolved
oshmem/mca/sshmem/ucx/configure.m4 Outdated Show resolved Hide resolved
oshmem/mca/sshmem/ucx/configure.m4 Show resolved Hide resolved
}
return segment_create_internal(ds_buf, mca_sshmem_base_start_address, size, flags, UCS_MEMORY_TYPE_HOST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long

uct_ib_md_release_device_mem(dev_mem);
/* fallback to regular allocation */
}
status = segment_create_internal(ds_buf, NULL, size, flags, UCS_MEMORY_TYPE_RDMA);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line seems too long

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f1671c2: Fixed CR - using DECLS auto generated macros

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash

@roiedanino roiedanino force-pushed the minor-fixes-for-memic-support branch from 64e931b to 2858a93 Compare August 23, 2023 08:22
Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gleon99 can you pls take a look as well?


#if HAVE_UCX_DEVICE_MEM
int ret = OSHMEM_ERROR;
#if HAVE_DECL_UCS_MEMORY_TYPE_RDMA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an indicative debug message for the user, in case one had passed a hint which was not applied eventually. It can happen in 3 scenarios:

  1. NIC_MEM passed but the #if is false.
  2. NIC_MEM passed but create_internal failed -> fallback to host memory..
  3. Unknown / invalid "hint" was passed.

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

2fe0dcd: MCA/SSHMEM/UCX: Added warning messages for fallbac...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@roiedanino roiedanino force-pushed the minor-fixes-for-memic-support branch 2 times, most recently from 5456bf4 to e37a850 Compare August 24, 2023 09:09
/* Fallback - Try again using host memory*/
}
#else
SSHMEM_WARN("UCX version is too old and won't support "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? It would be printed independent of the hint, even if 0..

@@ -189,8 +189,18 @@ segment_create(map_segment_t *ds_buf,
ds_buf->allocator = &sshmem_ucx_allocator;
return OSHMEM_SUCCESS;
}
else if(hint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If includes the possibility of hint == SHMEM_HINT_DEVICE_NIC_MEM. What's the reasoning?

SSHMEM_WARN("UCX version is too old and won't support "
"SHMEM_HINT_DEVICE_NIC_MEM hint, fallback to host memory");
#else
SSHMEM_WARN("UCX is not supporting SHMEM_HINT_DEVICE_NIC_MEM "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSHMEM_VERBOSE(1, "DEVICE_NIC_MEM hint ignored since UCX does not support MEMORY_TYPE_RDMA")

static int
segment_create(map_segment_t *ds_buf,
const char *file_name,
size_t size, long hint)
{
mca_spml_ucx_t *spml = (mca_spml_ucx_t*)mca_spml.self;
unsigned flags;
unsigned flags = UCP_MEM_MAP_ALLOCATE;
int status = OSHMEM_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this init value is not used in any scenario?

}
if (hint & ~SHMEM_HINT_DEVICE_NIC_MEM) {
SSHMEM_WARN("Hint was not recognized therefore ignored, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Better used unknown/invalid instead of "was not recognized.
  2. Print the passed hint and the "allowed" hints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't need to print any error in this case

SSHMEM_WARN("UCX version is too old and won't support "
"SHMEM_HINT_DEVICE_NIC_MEM hint, fallback to host memory");
#else
SSHMEM_WARN("UCX is not supporting SHMEM_HINT_DEVICE_NIC_MEM "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSHMEM_VERBOSE(1, "DEVICE_NIC_MEM hint ignored since UCX does not support MEMORY_TYPE_RDMA")

oshmem/mca/sshmem/ucx/sshmem_ucx_module.c Outdated Show resolved Hide resolved
}
if (hint & ~SHMEM_HINT_DEVICE_NIC_MEM) {
SSHMEM_WARN("Hint was not recognized therefore ignored, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't need to print any error in this case

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

1f8f040: SSHMEM/UCX: fixed logging

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

MCA_COMMON_UCX_VERBOSE(1, "%s failed: %d, %s", (_msg) ? (_msg) : __func__, \
UCS_PTR_STATUS(_request), \
ucs_status_string(UCS_PTR_STATUS(_request))); \
MCA_COMMON_UCX_VERBOSE(1, "%s failed2: %d, %s", (_msg) ? (_msg) : __func__,\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover

@yosefe
Copy link
Contributor

yosefe commented Aug 29, 2023

@roiedanino next time pls don't force push


status = ucp_mem_map(spml->ucp_context, &mem_map_params, &mem_h);
if (UCS_OK != status) {
SSHMEM_ERROR("ucp_mem_map() failed: %s\n", ucs_status_string(status));
SSHMEM_ERROR("Failed to allocate DEVICE_NIC_MEM"
Copy link
Contributor

@gleon99 gleon99 Sep 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this error msg is indicative.
In this particular context, (in theory) it might not necessarily be only about DEVICE_NIC_MEM...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SSHMEM_ERROR("Failed to allocate DEVICE_NIC_MEM"
SSHMEM_ERROR("ucp_mem_map() failed: %s %s\n",
ucs_status_string(status),
mem_type == UCS_MEMORY_TYPE_RDMA ?
"failed to allocate DEVICE_NIC_MEM, falls back to host memory" : "");

Maybe that way?

oshmem/mca/sshmem/ucx/sshmem_ucx_module.c Show resolved Hide resolved
Comment on lines 191 to 193
} else {
SSHMEM_WARN("Failed to allocate DEVICE_NIC_MEM: %s, "
"fallback to host memory", ucs_status_string(status));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

oshmem/mca/sshmem/ucx/sshmem_ucx_module.c Show resolved Hide resolved
@@ -123,7 +123,9 @@ segment_create_internal(map_segment_t *ds_buf, void *address, size_t size,

status = ucp_mem_map(spml->ucp_context, &mem_map_params, &mem_h);
if (UCS_OK != status) {
SSHMEM_ERROR("ucp_mem_map() failed: %s\n", ucs_status_string(status));
SSHMEM_VERBOSE(err_level, "ucp_mem_map(%s) failed: %s\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_type=%s

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash

if (status == OSHMEM_SUCCESS) {
ds_buf->alloc_hints = hint;
ds_buf->allocator = &sshmem_ucx_allocator;
return OSHMEM_SUCCESS;
}
#else
SSHMEM_VERBOSE(1, "DEVICE_NIC_MEM hint ignored since UCX does not "
SSHMEM_VERBOSE(20, "DEVICE_NIC_MEM hint ignored since UCX does not "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use level 3

return OSHMEM_ERR_BAD_PARAM;
#if HAVE_DECL_UCS_MEMORY_TYPE_RDMA
status = segment_create_internal(ds_buf, NULL, size, flags,
UCS_MEMORY_TYPE_RDMA, 20);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use level 3 instead of 20

type

Signed-off-by: Roie Danino <[email protected]>

Added a fallback for rdma allocation failure - allocating host memory instead

Signed-off-by: Roie Danino <[email protected]>
Copy link
Contributor

@gleon99 gleon99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe can we merge this one?

@yosefe yosefe merged commit 3de90b1 into open-mpi:main Oct 8, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants