ompi/datatype: use size_t for count arguments #12351

wenduwan · 2024-02-19T22:33:12Z

The use of int for count arguments is becoming restrictive especially to adopt large count. This change extends the argument type to size_t.

bosilca · 2024-02-20T03:13:18Z

This change in API will break the MPI datatype management in many ways (basically everything that relies on the datatype storage description such as one-sided, IO, and all the datatype representation manipulation functions, the combiner_*). The root cause is that the internal storage format for the datatype, and all the functions to expose it to other libraries are based on int32_t.

I have the feeling that the correct solution is to split the datatype API in two, one used by MPI where we follow MPI expectations, and the internal one where we can use counts but we will not provide any data representation support (the support will remain at the MPI level). Based on this, we will only use the internal API inside OMPI and the external API will be reserved for the MPI layer.

wenduwan · 2024-02-20T15:22:35Z

I also had the concern of breaking API compatibility but none of our CI(internal and gh action) actually caught this so that's interesting.

I was reading the code and saw that the count argument is sometimes typed size_t like here - so I have a hunch that it should work somehow and it is possible to change the type safely.

I also agree with the proposed solution to introduce another internal API - I haven't yet figured out how.

jsquyres · 2024-03-05T16:15:05Z

@wenduwan It looks like you are still working on this. Do you want to move this PR to Draft?

bosilca · 2024-07-09T21:08:50Z

ompi/datatype/ompi_datatype.h

@@ -150,7 +150,7 @@ ompi_datatype_is_predefined( const ompi_datatype_t* type )
 }

 static inline int32_t
-ompi_datatype_is_contiguous_memory_layout( const ompi_datatype_t* type, int32_t count )
+ompi_datatype_is_contiguous_memory_layout( const ompi_datatype_t* type, size_t count )


This makes no sense without modifying first opal_datatype_is_contiguous_memory_layout.

bosilca · 2024-07-09T21:22:33Z

ompi/datatype/ompi_datatype.h

@@ -188,20 +188,20 @@ ompi_datatype_add( ompi_datatype_t* pdtBase, const ompi_datatype_t* pdtAdd, size
 OMPI_DECLSPEC int32_t
 ompi_datatype_duplicate( const ompi_datatype_t* oldType, ompi_datatype_t** newType );

-OMPI_DECLSPEC int32_t ompi_datatype_create_contiguous( int count, const ompi_datatype_t* oldType, ompi_datatype_t** newType );


These are more complicated as we would need to change the ddt_elem_desc and ddt_loop_desc structs to have count as int. Unfortunately, that will change the size of these structs, and increase the overall size of the datatype representation.

bosilca · 2024-07-09T21:25:04Z

ompi/datatype/ompi_datatype_create_indexed.c

@@ -31,13 +31,12 @@


 /* We try to merge together data that are contiguous */
-int32_t ompi_datatype_create_indexed( int count, const int* pBlockLength, const int* pDisp,


Making the count in indexed datatypes a size_t makes very little sense, because it indicates the number of entries in the displacement and blocklen arrays. In other terms the user-facing datatype representation will be several gigabytes long.

I can understand that for symmetry with the contiguous one would like to have this as a size_t, but my comment above remains valid.

@bosilca I need your thoughts here. I figured MPI_Type_indexed also has the large-count variant MPI_Type_indexed_c, which I imagine will also need size_t support in ompi_datatype_create_indexed and related functions. Did I missing anything?

Same goes for *struct and *vector.

@wenduwan Check out the work that @hppritcha and Jakob did on the *w collectives. There are now disp and count arrays that store whether the source is 32 or 64bit and hand out 64bit consistently: https://github.com/open-mpi/ompi/blob/main/ompi/util/count_disp_array.h

The count is the number of elements in the indexed type, and if that number cannot be represented as an int, it basically means that the datatype description (where we need 64 bytes per contiguous element to represent) will be bigger than the represented data. No normal person should even imagine such an API.

@devreal suggestion is mostly irrelevant here. In the collective case the user-provided buffers are guaranteed to be available during the entire collective operation, so piggybacking into the user buffer is possible. There is no such guarantees with the datatype.

@bosilca Thanks. Could you elaborate more on "the datatype description .. will be bigger than the represented data." part? Are you referring to this struct? I noted that opal_datatype_count_t is already size_t

struct dt_type_desc_t { opal_datatype_count_t length; /**< the maximum number of elements in the description array */ opal_datatype_count_t used; /**< the number of used elements in the description array */ dt_elem_desc_t *desc; };

Each data in the indexed/struct will be represented by an ddt_elem_desc struct which is 32 bytes. So, if you have a number of entries that will not fit into an int, it means what the data description itself will be over 64GB, there will be little memory left for the data itself (especially that if you use an index or struct you have gaps between these elements).

@bosilca Please correct me if I'm wrong. The fundamental concern here is the datatype descriptor struct size.

The root of this issue is in ompi_datatype_add and therefore opal_datatype_add, which allocates new memory here:

ompi/opal/datatype/opal_datatype_add.c

Lines 297 to 304 in ce2310a

pdtBase->bdt_used |= pdtAdd->bdt_used;

newLength = pdtBase->desc.used + place_needed;

if (newLength > pdtBase->desc.length) {

newLength = ((newLength / DT_INCREASE_STACK) + 1) * DT_INCREASE_STACK;

pdtBase->desc.desc = (dt_elem_desc_t *) realloc(pdtBase->desc.desc,

sizeof(dt_elem_desc_t) * newLength);

pdtBase->desc.length = newLength;

}

I am likely missing important context here. Why would it be a problem for indexed/struct but not vector, which also calls into ompi_datatype_add? 🤔

Also, suppose the user has a huge amount of memory to waste, would this concern still hold?

The vector representation is compact because its regular and repetitive, with a single ddt_elem_desc (aka. 64 bytes) one can describe a datatype that covers the entire memory. Indexed and struct representations have a number of ddt_elem_desc entries (I'm not talking about the datatype count here), and when the number of entries is larger than an int (which is what we are talking about here), the datatype representation itself will be comparable with the memory layout it covers.

Ah I see. ompi_datatype_create_vector is not calling ompi_datatype_create in a loop, but indexed/struct does.

But still, without embiggening the indexed/struct count(as well as block length, displ) how can we support the MPI_Type_{indexed,struct}_c APIs?

My comment was mostly about how stupid that API is, not about how we support it. And I'm not even talking about the performance of parsing that extremely large description to pack/unpack data.

But if you ask me how to do it, I would check that that number is below INT_MAX, and return some error otherwise (not enough memory or something). If the number is reasonable, we keep doing as today.

bosilca · 2024-07-09T21:26:14Z

ompi/datatype/ompi_datatype_create_struct.c

@@ -28,13 +28,12 @@

 #include "ompi/datatype/ompi_datatype.h"

-int32_t ompi_datatype_create_struct( int count, const int* pBlockLength, const ptrdiff_t* pDisp,


Same comment as for indexed types.

bosilca · 2024-07-09T21:26:38Z

ompi/datatype/ompi_datatype_create_vector.c

@@ -28,7 +28,7 @@

 #include "ompi/datatype/ompi_datatype.h"

-int32_t ompi_datatype_create_vector( int count, int bLength, int stride,


This one would be acceptable I guess.

wenduwan · 2024-07-15T21:34:36Z

@bosilca Thanks for your comments. I'm looking into this PR again.

This patch prepares the opal datatype engine for large count support. Related function arguments need to accept size_t input, and accordingly we had to modify codes where those functions are called with smaller integer types. Signed-off-by: Wenduo Wang <[email protected]>

wenduwan · 2024-07-26T14:36:32Z

opal/datatype/opal_datatype_create.c

 {
    opal_datatype_t *datatype = (opal_datatype_t *) OBJ_NEW(opal_datatype_t);

-    if (expectedSize == -1) {
+    if (expectedSize == (size_t) -1) {


I'm very unsure about this - what is the scenario where size will be -1?

wenduwan · 2024-07-26T14:53:36Z

After a fresh look at the change, it appears to have a much larger blast radius than I expected.

The original revision might started with the wrong place - it was changing ompi datatype APIs but that requires adaptation to the internal opal datatypes. With that in mind, I think it is better to start with opal datatypes instead. Updated PR accordingly.

Still looking at other opal functions to find out what else needs to change.

wenduwan · 2024-09-10T20:27:49Z

@hppritcha @bosilca I'm sorry that I won't have time to work on this. Unfortunately I have to leave this to someone else. Closing the PR.

wenduwan requested review from bosilca and hppritcha February 19, 2024 22:33

wenduwan self-assigned this Feb 19, 2024

github-actions bot added the Target: main label Feb 19, 2024

wenduwan marked this pull request as draft March 5, 2024 16:16

jtronge mentioned this pull request Jun 17, 2024

Update collective framework count/disp arrays for bigcount #12621

Merged

bosilca reviewed Jul 9, 2024

View reviewed changes

wenduwan force-pushed the datatype_use_size_t branch from ae65011 to c79bc2d Compare July 26, 2024 14:34

wenduwan commented Jul 26, 2024

View reviewed changes

wenduwan closed this Sep 10, 2024

wenduwan deleted the datatype_use_size_t branch September 10, 2024 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ompi/datatype: use size_t for count arguments #12351

ompi/datatype: use size_t for count arguments #12351

wenduwan commented Feb 19, 2024

bosilca commented Feb 20, 2024

wenduwan commented Feb 20, 2024

jsquyres commented Mar 5, 2024

bosilca Jul 9, 2024

bosilca Jul 9, 2024

bosilca Jul 9, 2024

wenduwan Jul 24, 2024

devreal Jul 24, 2024

bosilca Jul 24, 2024

wenduwan Jul 24, 2024

bosilca Jul 25, 2024

wenduwan Jul 25, 2024

bosilca Jul 25, 2024

wenduwan Jul 25, 2024

bosilca Jul 25, 2024

bosilca Jul 9, 2024

bosilca Jul 9, 2024

wenduwan commented Jul 15, 2024

wenduwan Jul 26, 2024

wenduwan commented Jul 26, 2024

wenduwan commented Sep 10, 2024

		@@ -31,13 +31,12 @@


		/* We try to merge together data that are contiguous */
		int32_t ompi_datatype_create_indexed( int count, const int* pBlockLength, const int* pDisp,

	pdtBase->bdt_used \|= pdtAdd->bdt_used;
	newLength = pdtBase->desc.used + place_needed;
	if (newLength > pdtBase->desc.length) {
	newLength = ((newLength / DT_INCREASE_STACK) + 1) * DT_INCREASE_STACK;
	pdtBase->desc.desc = (dt_elem_desc_t *) realloc(pdtBase->desc.desc,
	sizeof(dt_elem_desc_t) * newLength);
	pdtBase->desc.length = newLength;
	}

		@@ -28,13 +28,12 @@

		#include "ompi/datatype/ompi_datatype.h"

		int32_t ompi_datatype_create_struct( int count, const int* pBlockLength, const ptrdiff_t* pDisp,

		@@ -28,7 +28,7 @@

		#include "ompi/datatype/ompi_datatype.h"

		int32_t ompi_datatype_create_vector( int count, int bLength, int stride,

ompi/datatype: use size_t for count arguments #12351

ompi/datatype: use size_t for count arguments #12351

Conversation

wenduwan commented Feb 19, 2024

bosilca commented Feb 20, 2024

wenduwan commented Feb 20, 2024

jsquyres commented Mar 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenduwan commented Jul 15, 2024

Choose a reason for hiding this comment

wenduwan commented Jul 26, 2024

wenduwan commented Sep 10, 2024