Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building version v5.0.6 with CUDA fails #12924

Open
lahwaacz opened this issue Nov 16, 2024 · 12 comments
Open

Building version v5.0.6 with CUDA fails #12924

lahwaacz opened this issue Nov 16, 2024 · 12 comments

Comments

@lahwaacz
Copy link
Contributor

Building OpenMPI version v5.0.6 with CUDA fails with the following error:

coll_cuda_module.c: In function ‘mca_coll_cuda_comm_query’:
coll_cuda_module.c:107:42: error: assignment to ‘mca_coll_base_module_reduce_local_fn_t’ {aka ‘int (*)(const void *, void *, int,  struct ompi_datatype_t *, struct ompi_op_t *, struct mca_coll_base_module_2_4_0_t *)’} from incompatible pointer type ‘int (*)(const void *, void *, size_t,  struct ompi_datatype_t *, struct ompi_op_t *, mca_coll_base_module_t *)’ {aka ‘int (*)(const void *, void *, long unsigned int,  struct ompi_datatype_t *, struct ompi_op_t *, struct mca_coll_base_module_2_4_0_t *)’} [-Wincompatible-pointer-types]
  107 |     cuda_module->super.coll_reduce_local = mca_coll_cuda_reduce_local;
      |                                          ^
@lahwaacz
Copy link
Contributor Author

Seems to be caused by e3ad86e

It specifies the function with size_t count parameter, which does not work because 4d7c65d was done only on the main branch, not on v5.0.x.

@bosilca
Copy link
Member

bosilca commented Nov 18, 2024

SC24 travel day, can't do more than this:

diff --git a/ompi/mca/coll/cuda/coll_cuda.h b/ompi/mca/coll/cuda/coll_cuda.h
index afedc632ee..4b3ecc647e 100644
--- a/ompi/mca/coll/cuda/coll_cuda.h
+++ b/ompi/mca/coll/cuda/coll_cuda.h
@@ -54,7 +54,7 @@ int mca_coll_cuda_reduce(const void *sbuf, void *rbuf, int count,
                          struct ompi_communicator_t *comm,
                          mca_coll_base_module_t *module);
 
-int mca_coll_cuda_reduce_local(const void *sbuf, void *rbuf, size_t count,
+int mca_coll_cuda_reduce_local(const void *sbuf, void *rbuf, int count,
                                struct ompi_datatype_t *dtype,
                                struct ompi_op_t *op,
                                mca_coll_base_module_t *module);
diff --git a/ompi/mca/coll/cuda/coll_cuda_reduce.c b/ompi/mca/coll/cuda/coll_cuda_reduce.c
index 7743a07874..e165a1d9bb 100644
--- a/ompi/mca/coll/cuda/coll_cuda_reduce.c
+++ b/ompi/mca/coll/cuda/coll_cuda_reduce.c
@@ -85,7 +85,7 @@ mca_coll_cuda_reduce(const void *sbuf, void *rbuf, int count,
 }
 
 int
-mca_coll_cuda_reduce_local(const void *sbuf, void *rbuf, size_t count,
+mca_coll_cuda_reduce_local(const void *sbuf, void *rbuf, int count,
                            struct ompi_datatype_t *dtype,
                            struct ompi_op_t *op,
                            mca_coll_base_module_t *module)

@edgargabriel
Copy link
Member

did we not backport the cuda compilation check to the 5.0.x branch? That should/would have caught the issue

@lahwaacz
Copy link
Contributor Author

@bosilca @janjust Any progress? This is blocking us from upgrading to the latest release.

@janjust
Copy link
Contributor

janjust commented Nov 25, 2024

@lahwaacz back from SC - does the above patch work for you? For now we have to consider v5.0.6 broken for cuda builds, I'll push an rc1 asap. with a release week after "hopefully"

@lahwaacz
Copy link
Contributor Author

@janjust The patch makes the build pass, but I have no idea if the patched release actually works - obviously nobody tested it yet with cuda...

@janjust
Copy link
Contributor

janjust commented Nov 25, 2024

So that's the thing, I was about to ask on slack/general

But for some odd reason, the v5.0.6 builds coll/cuda just fine for me.

$ ./ompi_info -a | grep cuda
30:  Configure command line: '--prefix=/hpc/mtr_scrap/users/tomislavj/ompi-dev/openmpi-5.0.6/install' '--with-cuda=/hpc/local/oss/cuda12.6.1/redhat8' '--with-ucx=/hpc/mtr_scrap/users/tomislavj/ompi-dev/ucx-build/install' '--with-libevent=internal' '--enable-prte-prefix-by-default' '--enable-mpi1-compatibility' '--without-xpmem' '--with-pmix=internal' '--with-prrte=internal' '--enable-mca-dso=all' '--disable-oshmem' '--enable-mpi-fortran=no'
125:          MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
136:         MCA accelerator: cuda (MCA v2.1.0, API v1.0.0, Component v5.0.6)
142:                 MCA btl: smcuda (MCA v2.1.0, API v3.3.0, Component v5.0.6)
172:                MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.0.6) 
$ ./configure --prefix=/hpc/mtr_scrap/users/tomislavj/ompi-dev/openmpi-5.0.6/install --with-cuda=/hpc/local      /oss/cuda12.6.1/redhat8 --with-ucx=/hpc/mtr_scrap/users/tomislavj/ompi-dev/ucx-build/install --with-libevent=      internal --enable-prte-prefix-by-default --enable-mpi1-compatibility --without-xpmem --with-pmix=internal --w      ith-prrte=internal --enable-mca-dso=all --disable-oshmem --enable-mpi-fortran=no

And this is using the source tarball

@janjust
Copy link
Contributor

janjust commented Nov 25, 2024

@lahwaacz what compiler are you using, just curious
And what is your configure command?

@janjust
Copy link
Contributor

janjust commented Nov 25, 2024

nvm, I think I got it, @bosilca pointed out it's due to the -Werror flag missing

@janjust
Copy link
Contributor

janjust commented Nov 25, 2024

fixed with #12934

@lahwaacz
Copy link
Contributor Author

@janjust Thanks! Will you make a new release with the fix?

@janjust
Copy link
Contributor

janjust commented Nov 26, 2024

I'm hoping to cut an RC today or tomorrow; however, we're also investigating an osc issue which is currently blocking release. I'm afraid I don't have an eta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants