Building version v5.0.6 with CUDA fails #12924

lahwaacz · 2024-11-16T07:21:00Z

Building OpenMPI version v5.0.6 with CUDA fails with the following error:

coll_cuda_module.c: In function ‘mca_coll_cuda_comm_query’:
coll_cuda_module.c:107:42: error: assignment to ‘mca_coll_base_module_reduce_local_fn_t’ {aka ‘int (*)(const void *, void *, int,  struct ompi_datatype_t *, struct ompi_op_t *, struct mca_coll_base_module_2_4_0_t *)’} from incompatible pointer type ‘int (*)(const void *, void *, size_t,  struct ompi_datatype_t *, struct ompi_op_t *, mca_coll_base_module_t *)’ {aka ‘int (*)(const void *, void *, long unsigned int,  struct ompi_datatype_t *, struct ompi_op_t *, struct mca_coll_base_module_2_4_0_t *)’} [-Wincompatible-pointer-types]
  107 |     cuda_module->super.coll_reduce_local = mca_coll_cuda_reduce_local;
      |                                          ^

The text was updated successfully, but these errors were encountered:

lahwaacz · 2024-11-16T07:31:07Z

Seems to be caused by e3ad86e

It specifies the function with size_t count parameter, which does not work because 4d7c65d was done only on the main branch, not on v5.0.x.

bosilca · 2024-11-18T17:16:27Z

SC24 travel day, can't do more than this:

diff --git a/ompi/mca/coll/cuda/coll_cuda.h b/ompi/mca/coll/cuda/coll_cuda.h
index afedc632ee..4b3ecc647e 100644
--- a/ompi/mca/coll/cuda/coll_cuda.h
+++ b/ompi/mca/coll/cuda/coll_cuda.h
@@ -54,7 +54,7 @@ int mca_coll_cuda_reduce(const void *sbuf, void *rbuf, int count,
                          struct ompi_communicator_t *comm,
                          mca_coll_base_module_t *module);
 
-int mca_coll_cuda_reduce_local(const void *sbuf, void *rbuf, size_t count,
+int mca_coll_cuda_reduce_local(const void *sbuf, void *rbuf, int count,
                                struct ompi_datatype_t *dtype,
                                struct ompi_op_t *op,
                                mca_coll_base_module_t *module);
diff --git a/ompi/mca/coll/cuda/coll_cuda_reduce.c b/ompi/mca/coll/cuda/coll_cuda_reduce.c
index 7743a07874..e165a1d9bb 100644
--- a/ompi/mca/coll/cuda/coll_cuda_reduce.c
+++ b/ompi/mca/coll/cuda/coll_cuda_reduce.c
@@ -85,7 +85,7 @@ mca_coll_cuda_reduce(const void *sbuf, void *rbuf, int count,
 }
 
 int
-mca_coll_cuda_reduce_local(const void *sbuf, void *rbuf, size_t count,
+mca_coll_cuda_reduce_local(const void *sbuf, void *rbuf, int count,
                            struct ompi_datatype_t *dtype,
                            struct ompi_op_t *op,
                            mca_coll_base_module_t *module)

edgargabriel · 2024-11-18T20:37:59Z

did we not backport the cuda compilation check to the 5.0.x branch? That should/would have caught the issue

lahwaacz · 2024-11-24T10:00:18Z

@bosilca @janjust Any progress? This is blocking us from upgrading to the latest release.

janjust · 2024-11-25T15:08:55Z

@lahwaacz back from SC - does the above patch work for you? For now we have to consider v5.0.6 broken for cuda builds, I'll push an rc1 asap. with a release week after "hopefully"

lahwaacz · 2024-11-25T17:51:46Z

@janjust The patch makes the build pass, but I have no idea if the patched release actually works - obviously nobody tested it yet with cuda...

janjust · 2024-11-25T17:54:14Z

So that's the thing, I was about to ask on slack/general

But for some odd reason, the v5.0.6 builds coll/cuda just fine for me.

$ ./ompi_info -a | grep cuda
30:  Configure command line: '--prefix=/hpc/mtr_scrap/users/tomislavj/ompi-dev/openmpi-5.0.6/install' '--with-cuda=/hpc/local/oss/cuda12.6.1/redhat8' '--with-ucx=/hpc/mtr_scrap/users/tomislavj/ompi-dev/ucx-build/install' '--with-libevent=internal' '--enable-prte-prefix-by-default' '--enable-mpi1-compatibility' '--without-xpmem' '--with-pmix=internal' '--with-prrte=internal' '--enable-mca-dso=all' '--disable-oshmem' '--enable-mpi-fortran=no'
125:          MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
136:         MCA accelerator: cuda (MCA v2.1.0, API v1.0.0, Component v5.0.6)
142:                 MCA btl: smcuda (MCA v2.1.0, API v3.3.0, Component v5.0.6)
172:                MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.0.6)

$ ./configure --prefix=/hpc/mtr_scrap/users/tomislavj/ompi-dev/openmpi-5.0.6/install --with-cuda=/hpc/local      /oss/cuda12.6.1/redhat8 --with-ucx=/hpc/mtr_scrap/users/tomislavj/ompi-dev/ucx-build/install --with-libevent=      internal --enable-prte-prefix-by-default --enable-mpi1-compatibility --without-xpmem --with-pmix=internal --w      ith-prrte=internal --enable-mca-dso=all --disable-oshmem --enable-mpi-fortran=no

And this is using the source tarball

janjust · 2024-11-25T18:11:27Z

@lahwaacz what compiler are you using, just curious
And what is your configure command?

janjust · 2024-11-25T18:22:30Z

nvm, I think I got it, @bosilca pointed out it's due to the -Werror flag missing

janjust · 2024-11-25T20:10:32Z

fixed with #12934

lahwaacz · 2024-11-26T06:09:55Z

@janjust Thanks! Will you make a new release with the fix?

janjust · 2024-11-26T18:06:45Z

I'm hoping to cut an RC today or tomorrow; however, we're also investigating an osc issue which is currently blocking release. I'm afraid I don't have an eta.

jsquyres assigned bosilca and janjust Nov 17, 2024

jsquyres added question Target: v5.0.x labels Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building version v5.0.6 with CUDA fails #12924

Building version v5.0.6 with CUDA fails #12924

lahwaacz commented Nov 16, 2024

lahwaacz commented Nov 16, 2024

bosilca commented Nov 18, 2024

edgargabriel commented Nov 18, 2024

lahwaacz commented Nov 24, 2024

janjust commented Nov 25, 2024

lahwaacz commented Nov 25, 2024

janjust commented Nov 25, 2024 •

edited

Loading

janjust commented Nov 25, 2024

janjust commented Nov 25, 2024

janjust commented Nov 25, 2024

lahwaacz commented Nov 26, 2024

janjust commented Nov 26, 2024 •

edited

Loading

Building version v5.0.6 with CUDA fails #12924

Building version v5.0.6 with CUDA fails #12924

Comments

lahwaacz commented Nov 16, 2024

lahwaacz commented Nov 16, 2024

bosilca commented Nov 18, 2024

edgargabriel commented Nov 18, 2024

lahwaacz commented Nov 24, 2024

janjust commented Nov 25, 2024

lahwaacz commented Nov 25, 2024

janjust commented Nov 25, 2024 • edited Loading

janjust commented Nov 25, 2024

janjust commented Nov 25, 2024

janjust commented Nov 25, 2024

lahwaacz commented Nov 26, 2024

janjust commented Nov 26, 2024 • edited Loading

janjust commented Nov 25, 2024 •

edited

Loading

janjust commented Nov 26, 2024 •

edited

Loading