Replies: 9 comments 2 replies
-
How are you saving the data in .txt files? import netket as nk
# run your computation
if nk.utils.mpi.rank == 0:
with open("my file.txt", "w") as f:
f.write(data) all ranks will be executing the serialization code. Did you try using NetKet's loggers like |
Beta Was this translation helpful? Give feedback.
-
However... What I see there is that you have (N_nodes is what is reported by By the way, I'd suggest setting max 2 cpus per task for optimal performance, not more. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your help! Indeed, by adding your flag I don't get multiple .txt files, so hopefully MPI is working. However, I am not seeing a significant increase in the computation speed. Just as a check, I tried with the Jastrow part of your tutorial code https://www.netket.org/tutorials/Heisenberg1d.html (which apparently you ran in 21.09753 seconds). With the same parameters and by setting --ntasks=3 and --cpus-per-task=2, I get 48 seconds; It seems quite slow, isn't it? In reply to your questions: actually I set n_tasks=2 and not 3 as I wrote (my mistake). However, I realized that sometimes |
Beta Was this translation helpful? Give feedback.
-
That is normal. You have to tune a bit the parameters... Read this in particular. If you do this under MPI with 2 nodes you will have You do speed up the other computations (the gradient, SR if you use it) but sampling (which is a bottleneck in netket) will be slower.
I'm rather confident that number of available cpus is what jax ends up using. Can you tell me in which cases you get a mismatch in the available cpus reported from what you set with mpirun? and what is the structure of your cluster (how many cpus per node...) Also, you can try to use |
Beta Was this translation helpful? Give feedback.
-
I don't clearly understand your calculation. When you
Ok I will check about this... Just another question: if I want to run my optimization with different values of the hamiltonian parameters, what I need to do? P.S. thank you again! |
Beta Was this translation helpful? Give feedback.
-
Yes, but then the cost is 225 vs 163, which is just a 20% improvement.
Anything is good |
Beta Was this translation helpful? Give feedback.
-
I am also having some trouble when running with mpi. I am trying to get the 1dheisenberg to run with -np>2, which always crashes with some LLVM error:
submitting a script with the netket testing tool:
will give me:
This is weird because the nodes should have 40 cpus. When I do mpirun -np 3(or more) python3 -m netket.tools.check_mpi the output correctly says
I am also confused by n_nodes:2, as -np should specify the number of process I run in parallel right? I am running netket version 3.0.4 Any help is appreciated |
Beta Was this translation helpful? Give feedback.
-
Many thanks for your reply. However, by including "export OMP_NUM_THREADS=1" in the submit script, I can somehow increase -np up to 5 before I face the LLVM error once again. If I understood correctly, this slurm variable limits the number of threads per process right? so I am hitting something like a 5 thread limit(possibly per job) which does not seem very reasonable? 3)virtual memory limit. the output of "ulimit -a | grep virtual" in the command line(and also within the submit script) was "unlimited" so this is probably fine. Something else I could try? |
Beta Was this translation helpful? Give feedback.
-
How is netket actually running multiple chains per task? I am referring to the n_chains_per_task option. Is that something based on mpi or is it due toe Jax XLA? I mean netket 2 already had this n_chains feature without Jax Thanks |
Beta Was this translation helpful? Give feedback.
-
Hello! I'm running some simulations on a cluster that uses SLRUM as a job scheduling system. My NetKet Version is 3.0 Beta 1 Post9. In my code, I optimize a custom model defined in the Flax Linen API framework and then I write some data (values of energy etc.) on a .txt file. In order to speed up the computation, I'm trying to use MPI parallelization. For this purpose, I'm setting ntasks>1 in my job script and I'm running with: mpirun python my_programm.py. I realized that every time I run the code, the program produces several .txt files, not one (as expected). I presume that mpi is not working properly. However, if I run mpirun -np 2 python3 -m netket.tools.check_mpi, as recommended here https://www.netket.org/docs/sharp-bits.html#parallelization, I get a reasonable result, for example (by setting ntasks=3 and cpus-per-task=10):
As you can see, my openmpi version is 3.1.4 (since there are no more advanced versions available on the cluster). Have you any idea of what is happening?
Beta Was this translation helpful? Give feedback.
All reactions