Running with MPI on a SLURM cluster #720

guglami · 2021-05-08T09:28:57Z

guglami
May 8, 2021

Hello! I'm running some simulations on a cluster that uses SLRUM as a job scheduling system. My NetKet Version is 3.0 Beta 1 Post9. In my code, I optimize a custom model defined in the Flax Linen API framework and then I write some data (values of energy etc.) on a .txt file. In order to speed up the computation, I'm trying to use MPI parallelization. For this purpose, I'm setting ntasks>1 in my job script and I'm running with: mpirun python my_programm.py. I realized that every time I run the code, the program produces several .txt files, not one (as expected). I presume that mpi is not working properly. However, if I run mpirun -np 2 python3 -m netket.tools.check_mpi, as recommended here https://www.netket.org/docs/sharp-bits.html#parallelization, I get a reasonable result, for example (by setting ntasks=3 and cpus-per-task=10):

mpi_available                : True
mpi4jax_available            : True
avalable_cpus (rank 0)       : 2
n_nodes                      : 2
mpi4py | MPI version         : (3, 1)
mpi4py | MPI library_version : Open MPI v3.1.4, package: Open MPI abuild@ohpc1 Distribution, ident: 3.1.4, repo rev: v3.1.4, Apr 15, 2019

As you can see, my openmpi version is 3.1.4 (since there are no more advanced versions available on the cluster). Have you any idea of what is happening?

PhilipVinc · 2021-05-08T09:46:06Z

PhilipVinc
May 8, 2021
Maintainer

check_mpi reports that your MPI is set up correctly, so I'd expect netket to be running fine.

How are you saving the data in .txt files?
The way MPI works is that your code is being executed on every node, so if you don't 'hide' your serialization code behind a flag like

import netket as nk
# run your computation

if nk.utils.mpi.rank == 0:
    with open("my file.txt", "w") as f:
         f.write(data)

all ranks will be executing the serialization code.

Did you try using NetKet's loggers like netket.logging.JsonLog ? Together with VMC.run(...out=nk.logging.JsonLog("output")) ? This will do this automatically for you.

0 replies

PhilipVinc · 2021-05-08T09:50:19Z

PhilipVinc
May 8, 2021
Maintainer

However...
Are you sure the output of checkmpi you posted is for ntask==3 and ncpus_per_task==10 ?

What I see there is that you have 2 tasks (n_nodes) with 2 cpus per task (avalable_cpus (rank 0)).

(N_nodes is what is reported by MPI.Get_Size() and available cpus are the cpus that this MPI process has...)

By the way, I'd suggest setting max 2 cpus per task for optimal performance, not more.

0 replies

guglami · 2021-05-10T16:08:23Z

guglami
May 10, 2021
Author

Thanks for your help! Indeed, by adding your flag I don't get multiple .txt files, so hopefully MPI is working.

However, I am not seeing a significant increase in the computation speed. Just as a check, I tried with the Jastrow part of your tutorial code https://www.netket.org/tutorials/Heisenberg1d.html (which apparently you ran in 21.09753 seconds). With the same parameters and by setting --ntasks=3 and --cpus-per-task=2, I get 48 seconds;
with --ntasks=9 and --cpus-per-task=2, I get 46 seconds;
with --ntasks=12 and --cpus-per-task=2 I get 39 seconds...

It seems quite slow, isn't it?

In reply to your questions: actually I set n_tasks=2 and not 3 as I wrote (my mistake). However, I realized that sometimes
the number avalable_cpus (rank 0) does not seem to match with the parameter cpus-per-task I set in my SLURM job script.
Should I specify other options in addition to -np X when using mpirun (like CPUs options)? I set X equal to n_tasks, okay?

0 replies

PhilipVinc · 2021-05-10T16:34:55Z

PhilipVinc
May 10, 2021
Maintainer

However, I am not seeing a significant increase in the computation speed.

That is normal. You have to tune a bit the parameters... Read this in particular.
If you are taking 1000 samples, with 8 chains, and standard discard of 100 samples, you are effectively sampling ((1000/8) + 100 )*8 = 225*8 = 1800 samples.

If you do this under MPI with 2 nodes you will have ((1000/16) + 100)*16 = 163*16 = 2608 samples.

You do speed up the other computations (the gradient, SR if you use it) but sampling (which is a bottleneck in netket) will be slower.
For a fair comparison you should reduce the number of chains so that total chains are constant. Or/and increase samples so chains are longer...

the number avalable_cpus (rank 0) does not seem to match with the parameter cpus-per-task I set in my SLURM job script.

I'm rather confident that number of available cpus is what jax ends up using.
It's very hard to tell, however. Slurm usually does constraint the cpus... but maybe it doesn't if you allocate a whole node? I don't know.

Can you tell me in which cases you get a mismatch in the available cpus reported from what you set with mpirun? and what is the structure of your cluster (how many cpus per node...)

Also, you can try to use mpirun -np X --bind-to core. this should reliably fix the number of cpus per task

0 replies

guglami · 2021-05-12T15:01:30Z

guglami
May 12, 2021
Author

If you do this under MPI with 2 nodes you will have ((1000/16) + 100)*16 = 163*16 = 2608 samples.

I don't clearly understand your calculation. When you write (1000/16) + 100, I presume you are writing 16 since (in your example) n_chains*n_nodes=16, isn't it? However, since we are doing parallel computation, why are you multiplying by the same factor (i.e. ((1000/16) + 100)*16)? I thought that, since we have n_chains*n_nodes different Markov chains running in parallel, the time was just proportional to (1000/(n_chains*n_nodes)) + 100) ...

Can you tell me in which cases you get a mismatch in the available cpus reported from what you set with mpirun?

Ok I will check about this...

Just another question: if I want to run my optimization with different values of the hamiltonian parameters, what I need to do?
What if I just use a for cycle? Are there better ways in the case I'm using parallel computation (MPI)?

P.S. thank you again!

0 replies

PhilipVinc · 2021-05-12T15:27:32Z

PhilipVinc
May 12, 2021
Maintainer

I don't clearly understand your calculation. When you write (1000/16) + 100, I presume you are writing 16 since (in your example) n_chainsn_nodes=16, isn't it? However, since we are doing parallel computation, why are you multiplying by the same factor (i.e. ((1000/16) + 100)16)? I thought that, since we have n_chainsn_nodes different Markov chains running in parallel, the time was just proportional to (1000/(n_chainsn_nodes)) + 100) ...

Yes, but then the cost is 225 vs 163, which is just a 20% improvement.

Just another question: if I want to run my optimization with different values of the hamiltonian parameters, what I need to do?
What if I just use a for cycle? Are there better ways in the case I'm using parallel computation (MPI)?

Anything is good

0 replies

dambuck · 2022-01-10T19:31:05Z

dambuck
Jan 10, 2022

I am also having some trouble when running with mpi. I am trying to get the 1dheisenberg to run with -np>2, which always crashes with some LLVM error:

LLVM ERROR: pthread_create failed: Resource temporarily unavailable
[node45-001:241506] *** Process received signal ***
[node45-001:241506] Signal: Aborted (6)
[node45-001:241506] Signal code:  (-6)
[node45-001:241506] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b00c470f630]
[node45-001:241506] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b00c505b387]
[node45-001:241506] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b00c505ca78]
[node45-001:241506] [ 3] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x2e250cc)[0x2b00d04970cc]
[node45-001:241506] [ 4] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x2e7e13c)[0x2b00d04f013c]
[node45-001:241506] [ 5] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x2e7e338)[0x2b00d04f0338]
[node45-001:241506] [ 6] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x2e79afd)[0x2b00d04ebafd]
[node45-001:241506] [ 7] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x2e7bbc1)[0x2b00d04edbc1]
[node45-001:241506] [ 8] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x2d529b7)[0x2b00d03c49b7]
[node45-001:241506] [ 9] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x2d52c51)[0x2b00d03c4c51]
[node45-001:241506] [10] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x92e68c)[0x2b00cdfa068c]
[node45-001:241506] [11] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x919bff)[0x2b00cdf8bbff]
[node45-001:241506] [12] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x85d4b8)[0x2b00cdecf4b8]
[node45-001:241506] [13] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x5c65b5)[0x2b00cdc385b5]
[node45-001:241506] [14] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x5c69ce)[0x2b00cdc389ce]
[node45-001:241506] [15] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x5a7d06)[0x2b00cdc19d06]
[node45-001:241506] [16] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(PyCFunction_Call+0x13a)[0x43bcfa]
[node45-001:241506] [17] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(_PyObject_MakeTpCall+0xa6)[0x438cc6]
[node45-001:241506] [18] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3[0x5d6f2a]
[node45-001:241506] [19] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(_PyEval_EvalFrameDefault+0x594d)[0x42806d]
[node45-001:241506] [20] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3[0x421528]
[node45-001:241506] [21] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(_PyEval_EvalFrameDefault+0x30ff)[0x42581f]
[node45-001:241506] [22] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(_PyEval_EvalCodeWithName+0x85d)[0x4ec04d]
[node45-001:241506] [23] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(_PyFunction_Vectorcall+0xb2)[0x439702]
[node45-001:241506] [24] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(PyVectorcall_Call+0x5c)[0x43b8ec]
[node45-001:241506] [25] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(_PyEval_EvalFrameDefault+0x3b99)[0x4262b9]
[node45-001:241506] [26] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(_PyEval_EvalCodeWithName+0x85d)[0x4ec04d]
[node45-001:241506] [27] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(_PyFunction_Vectorcall+0xb2)[0x439702]
[node45-001:241506] [28] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3(PyVectorcall_Call+0x5c)[0x43b8ec]
[node45-001:241506] [29] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/bin/python3[0x586e06]
[node45-001:241506] *** End of error message ***

submitting a script with the netket testing tool:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2 
#SBATCH --time 00:05:00 
#SBATCH --partition=test

cd ${SLURM_SUBMIT_DIR}

module purge
module load mpi/openmpi/3.1.2-gcc-8.2.0

mpirun -np 2 python3 -m netket.tools.check_mpi > output.txt

will give me:

mpi4py_available             : True
mpi4jax_available            : True
avalable_cpus (rank 0)       : 2
n_nodes                      : 2
mpi4py | MPI version         : (3, 1)
mpi4py | MPI library_version : Open MPI v3.1.2, package: Open MPI [email protected] Distribution, ident: 3.1.2, repo rev: v3.1.2, Aug 22, 2018^@

This is weird because the nodes should have 40 cpus. When I do mpirun -np 3(or more) python3 -m netket.tools.check_mpi the output correctly says

avalable_cpus (rank 0)       : 40

I am also confused by n_nodes:2, as -np should specify the number of process I run in parallel right?

I am running netket version 3.0.4

Any help is appreciated

1 reply

PhilipVinc Jan 11, 2022
Maintainer

n_nodes:2 means number of MPI processes, not physical nodes, so that is fine.
available_cpus are detected by checking cat /proc/self/status | grep Cpus_allowed. This reports the number of cpus addressable by that process. I suspect your MPI/SLURM library might be setting a default value of SBATCH --cpus-per-task=2 in some cases. Try specifying this slurm variable too with an higher number. Be warned that cpus-per-task * n-tasks should be <= of the cpus on a node.

Now the elephant in the room:

LLVM ERROR: pthread_create failed: Resource temporarily unavailable
[node45-001:241506] *** Process received signal ***
[node45-001:241506] Signal: Aborted (6)
[node45-001:241506] Signal code:  (-6)
[node45-001:241506] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b00c470f630]
...
[node45-001:241506] [15] /home/compmatsc/kolbs/.pyenv/versions/pyenv_385/lib/python3.8/site-packages/jaxlib/xla_extension.so(+0x5a7d06)[0x2b00cdc19d06]

This is telling you that 1) the error was raised in libpthread, which is the library used by every program on unix to create a thread, 2) that libpthread was called by jax/XLA, 3) that libpthread failed to create a new thread because there are no resources.

This is all very standard: complex programs like XLA create tons of threads to do different tasks (I'd assume that using NetKet will trigger at least 100 processes). If for some reason your system is not able to create a new process, you'd see this kind of crash

This is not a netket (nor jax) bug, but most likely you are hitting either 1) a limit in the number of processes your user can create 2) the system cannot allocate virtual memory for your program for some reason 3) hitting virtual memory limit 4) there's so many 'leftover' programs running in the cluster/node that you hit a more fundamental limit in the number of processes.

I've seen (4) happening in some badly managed clusters in the past. In those cases a restart of the node solves it temporarily, but you should find the underlying cause. Anyhow, I advise you get in touch with your cluster admins

dambuck · 2022-01-12T09:33:41Z

dambuck
Jan 12, 2022

Many thanks for your reply.
I got in touch with the cluster support, who only suggested turning off hyperthreading which did not solve the problem. Apart from that the feedback was not very helpful... Sorry to come back here for another round of questions, even if they might be unrelated to the package itself

However, by including "export OMP_NUM_THREADS=1" in the submit script, I can somehow increase -np up to 5 before I face the LLVM error once again. If I understood correctly, this slurm variable limits the number of threads per process right? so I am hitting something like a 5 thread limit(possibly per job) which does not seem very reasonable?

3)virtual memory limit. the output of "ulimit -a | grep virtual" in the command line(and also within the submit script) was "unlimited" so this is probably fine.

Something else I could try?

0 replies

dambuck · 2022-01-19T13:44:15Z

dambuck
Jan 19, 2022

How is netket actually running multiple chains per task? I am referring to the n_chains_per_task option. Is that something based on mpi or is it due toe Jax XLA? I mean netket 2 already had this n_chains feature without Jax

Thanks

1 reply

PhilipVinc Jan 19, 2022
Maintainer

using implicit parallelism, which means that it won't use multiple cores whatsoever unless you have many many chains, and regardless performance will be bad even with many chains.

You can also try to use the experimental nk.experimental.sampler.MetropolisSamplerPmap. which runs the chains using pmap across the devices exposed by the environment variable export XLA_FLAGS=”–xla_force_host_platform_device_count=XX”, where XX is the number of desired cpu devices.
This will be a bit better, but still, nothing extraordinary.

In general MPI gives best performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NetKet

Running with MPI on a SLURM cluster #720

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

NetKet

Running with MPI on a SLURM cluster #720

guglami May 8, 2021

Replies: 9 comments · 2 replies

PhilipVinc May 8, 2021 Maintainer

PhilipVinc May 8, 2021 Maintainer

guglami May 10, 2021 Author

PhilipVinc May 10, 2021 Maintainer

guglami May 12, 2021 Author

PhilipVinc May 12, 2021 Maintainer

dambuck Jan 10, 2022

PhilipVinc Jan 11, 2022 Maintainer

dambuck Jan 12, 2022

dambuck Jan 19, 2022

PhilipVinc Jan 19, 2022 Maintainer

guglami
May 8, 2021

Replies: 9 comments 2 replies

PhilipVinc
May 8, 2021
Maintainer

PhilipVinc
May 8, 2021
Maintainer

guglami
May 10, 2021
Author

PhilipVinc
May 10, 2021
Maintainer

guglami
May 12, 2021
Author

PhilipVinc
May 12, 2021
Maintainer

dambuck
Jan 10, 2022

PhilipVinc Jan 11, 2022
Maintainer

dambuck
Jan 12, 2022

dambuck
Jan 19, 2022

PhilipVinc Jan 19, 2022
Maintainer