Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oshmem/shmem: Allocate and exchange base segment address beforehand #12889

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Oct 28, 2024

What

Processes have their _end that depends on the program built. Try negotiation first assuming symmetric layout will lead to same available memory areas. If not all ranks can create at the same position, fallback on the current hardcoded method.

We need to keep the mmap() as a reservation in all cases, so that intermediate library calls do not consume it in between. If that happens, UCX module overrides it, causing some later corruption.

Tested

  1. -mca sshmem_base_start_address 0xffffffffffffffff or no option: negotiation takes place, mmap reservation
  2. -mca sshmem_base_start_address 0x7f.....: no negotiation, mmap reservation, detection if failure to allocate.
  3. when one or more ranks fail to negotiate, all of them fallback on hardcoded method with mmap reservation

Static segment creation always skips module-created segment. Segments found in /proc/self/maps are always bigger or equal than module-allocated one.

Misc

Configure: ./configure --prefix=rfs --enable-debug --with-ucx=rfs
Options: -mca memheap_base_verbose 100, -mca sshmem sysv/mmap/ucx

#endif
}

if (mca_sshmem_base_start_address != memheap_mmap_get(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is based on mmap() behavior where it always creates vma at the hint position if possible. If this not always true (kernel vesions..), this could regress existing behavior and even fail to honor command line parameter.

Shall we remove that confirmation check and proceed regardless? Or maybe only ignore that check when address was passed from command line?

@@ -126,6 +126,7 @@ segment_create(map_segment_t *ds_buf,
/* init the contents of map_segment_t */
shmem_ds_reset(ds_buf);

(void)munmap(mca_sshmem_base_start_address, size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why added then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now "reserve" that area by holding an mmap() on it as it seems there is no randomization between mmap/munmap + mmap sequence and area could be consumed by unrelated mmap() in between.

Then on the modules we "overwrite" it with (ucp_mem_map() / mmap() / shmat()). It's a try to make it explicit, although it opens for race and mmap() anyways replaces it with MAP_FIXED.

Will remove, need to check with shmat() that it overwrites existing area too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed for mmap module, kept for sysv module as it is needed

@tvegas1
Copy link
Contributor Author

tvegas1 commented Oct 28, 2024

@brminich

Copy link
Member

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like negotiation is not done by default, as default value of sshmem_base_start_address remains the same

@@ -126,6 +126,7 @@ segment_create(map_segment_t *ds_buf,
/* init the contents of map_segment_t */
shmem_ds_reset(ds_buf);

(void)munmap(mca_sshmem_base_start_address, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why added then?

oshmem/mca/sshmem/sysv/sshmem_sysv_module.c Show resolved Hide resolved
oshmem/mca/memheap/base/memheap_base_static.c Show resolved Hide resolved
Comment on lines 157 to 162
rc = oshmem_shmem_allgather(&ptr, bases, sizeof(ptr));
if (OSHMEM_SUCCESS != rc) {
MEMHEAP_ERROR("Failed to exchange selected vma for base segment "
"(error %d)", rc);
goto out;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also introduce an option without fallback to the original behavior? Then allgatherv will not be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, in that case we could depend on mca_sshmem_base_start_address value:
1- if 0: bcast the pointer value, and any rank unable to create fails on its side, global failure
2- if UINTPTR_MAX: bcast the pointer value, allgather so that they all fallback on default value

default could be point 2-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

base = ptr;
}

rc = oshmem_shmem_bcast(&base, sizeof(base), 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich, tried the patch below where they all do the mmap(). the mmap() returned address is randomized like below, so we need some form of synchronization of the base adddress.

memheap_exchange_base_address() #1: exchange base address: base 0x7fa7d9dff000: ok
memheap_exchange_base_address() #3: exchange base address: base 0x7fdc5a15b000: ok
memheap_exchange_base_address() #2: exchange base address: base 0x7fe8aa56a000: ok
memheap_exchange_base_address() #0: exchange base address: base 0x7f3d1736b000: ok
diff --git a/oshmem/mca/memheap/base/memheap_base_select.c b/oshmem/mca/memheap/base/memheap_base_select.c
index 0ec74de6aa..0b0cfe4bee 100644
--- a/oshmem/mca/memheap/base/memheap_base_select.c
+++ b/oshmem/mca/memheap/base/memheap_base_select.c
@@ -134,21 +134,8 @@ static int memheap_exchange_base_address(size_t size, void **address)
         return OSHMEM_ERROR;
     }

-    if (oshmem_my_proc_id() == 0) {
-        ptr = memheap_mmap_get(NULL, size);
-        base = ptr;
-    }
-
-    rc = oshmem_shmem_bcast(&base, sizeof(base), 0);
-    if (OSHMEM_SUCCESS != rc) {
-        MEMHEAP_ERROR("Failed to exchange allocated vma for base segment "
-                      "(error %d)", rc);
-        goto out;
-    }
-
-    if (oshmem_my_proc_id() != 0) {
-        ptr = memheap_mmap_get(base, size);
-    }
+    ptr = memheap_mmap_get(NULL, size);
+    base = ptr;

     MEMHEAP_VERBOSE(100, "#%d: exchange base address: base %p: %s",
                     oshmem_my_proc_id(), base,

@tvegas1
Copy link
Contributor Author

tvegas1 commented Nov 5, 2024

seems like negotiation is not done by default, as default value of sshmem_base_start_address remains the same

i do not understand that comment since new default address is ~0 and rank 0 allocates and bcast's the pointer value, but ack it is not a full negotiation.

Comment on lines +174 to +177
} else if (ptr != base) {
/* Any failure terminates the rank and others start teardown */
rc = OSHMEM_ERROR;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use this as default flow (i mean setting mca_sshmem_base_start_address = NULL by default)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@brminich
Copy link
Member

@yosefe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants