Merge pull request #502 from zhenghh04/main

Updated frameworks documentation pages
argonne-lcf · Nov 8, 2024 · 071218e · 071218e
2 parents 959855a + af8bc01
commit 071218e
Show file tree

Hide file tree

Showing 8 changed files with 156 additions and 77 deletions.
diff --git a/docs/aurora/data-science/frameworks/oneCCL.md b/docs/aurora/data-science/frameworks/oneCCL.md
@@ -18,16 +18,12 @@ kaushikvelusamy@aurora-uan-0012:~>  module load frameworks
 /opt/aurora/24.180.0/CNDA/oneapi/ccl/2021.13.1_20240808.145507
 ```
 
-
+--8<-- [start:onecclenv]
 **OneCCL mandatory environment variables**
 
-```bash
-module load frameworks
-echo $CCL_ROOT
-export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
-export CPATH=$CCL_ROOT/include:$CPATH
-export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+The parameters below are recommended to be set all the time as it eigher gives the best performance for all applications or are requires to address potential hang / crash at large scale. 
 
+```bash
 export CCL_PROCESS_LAUNCHER=pmix  
 export CCL_ATL_TRANSPORT=mpi
 export CCL_ALLREDUCE=topo
@@ -41,9 +37,15 @@ export CCL_KVS_CONNECTION_TIMEOUT=600
 
 export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
 export CCL_KVS_USE_MPI_RANKS=1
+
+export MPI_PROVIDER=$FI_PROVIDER
+unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
+unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
+unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
 ```
 
 **OneCCL optional environment variables**
+The impact of the following environment variable might be application dependent. Users are encourage to try to set them and see whether they help their applications. 
 
 ```bash
 ulimit -c unlimited
@@ -53,17 +55,14 @@ export FI_CXI_RX_MATCH_MODE=hybrid
 export FI_CXI_OFLOW_BUF_SIZE=8388608
 export FI_CXI_DEFAULT_CQ_SIZE=1048576
 export FI_CXI_CQ_FILL_PERCENT=30
-export MPI_PROVIDER=$FI_PROVIDER
-unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
-unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
 export INTELGT_AUTO_ATTACH_DISABLE=1
 export PALS_PING_PERIOD=240
 export PALS_RPC_TIMEOUT=240
 export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
 export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
 export CCL_OP_SYNC=1 #to avoid potential hang at large scale
 ```
-
+--8<-- [end:onecclenv]
 
 **Algorithm selection**
 

diff --git a/docs/aurora/data-science/frameworks/pytorch.md b/docs/aurora/data-science/frameworks/pytorch.md
@@ -12,15 +12,15 @@ the frameworks module. To use it from a compute node, please load the following
 
 ```
 module use /soft/modulefiles/
-module load frameworks/2023.12.15.001
+module load frameworks
 ```
 Then you can `import` PyTorch as usual, the following is an output from the
-`frameworks/2023.12.15.001` module
+`frameworks` module
 
 ```
 >>> import torch
 >>> torch.__version__
-'2.0.1a0+cxx11.abi'
+'2.3.1+cxx11.abi'
 ```
 A simple but useful check could be to use PyTorch to get device information on
 a compute node. You can do this the following way:
@@ -128,22 +128,12 @@ Some of the Aurora specific details might be helpful to you:
 The following environmental variables should be set on the batch submission 
 script (PBSPro script) in the case of attempting to run beyond 16 nodes.
 
-```shell
-# This is a fix for running over 16 nodes:
-export FI_CXI_DEFAULT_CQ_SIZE=131072
-export FI_CXI_OFLOW_BUF_SIZE=8388608
-export FI_CXI_CQ_FILL_PERCENT=20
+--8<-- [start:commononecclenv]
+#### oneCCL environment variable
+--8<-- "./docs/aurora/data-science/frameworks/oneCCL.md:onecclenv"
 
-export FI_LOG_LEVEL=warn
-#export FI_LOG_PROV=tcp
-export FI_LOG_PROV=cxi
-
-export MPIR_CVAR_ENABLE_GPU=0
-# This is to disable certain GPU optimizations like the use of XeLinks between
-# GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
-# overheads. Benefits are application dependent.
-export CCL_KVS_GET_TIMEOUT=600
-```
+These environment variable settings will probably be included in the framework module file in the future. But for now, users need to explicitly set these in the submission script. 
+--8<-- [end:commononecclenv]
 
 In order to run an application with `TF32` precision type, one must set the 
 following environmental parameter:
@@ -314,7 +304,7 @@ export IPEX_FP32_MATH_MODE=TF32
 #####################################################################
 
 module use /soft/modulefiles
-module load frameworks/2023.12.15.001
+module load frameworks
 
 export NUMEXPR_NUM_THREADS=64
 # This is to resolve an issue due to a package called "numexpr". 
@@ -333,6 +323,37 @@ export NUMEXPR_NUM_THREADS=64
 # JOB LAUNCH
 ######################################################################
 
+
+## CCL setup
+export FI_CXI_DEFAULT_CQ_SIZE=131072
+export FI_CXI_OVFLOW_BUF_SIZE=8388608
+export FI_CXI_CQ_FILL_PERCENT=20
+
+export FI_LOG_LEVEL=warn
+#export FI_LOG_PROV=tcp
+export FI_LOG_PROV=cxi
+
+export CCL_KVS_GET_TIMEOUT=600
+
+export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
+export CPATH=$CCL_ROOT/include:$CPATH
+export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+
+export CCL_PROCESS_LAUNCHER=pmix  
+export CCL_ATL_TRANSPORT=mpi
+export CCL_ALLREDUCE=topo
+export CCL_ALLREDUCE_SCALEOUT=rabenseifner  # currently best allreduce algorithm at large scale
+export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
+
+export CCL_KVS_MODE=mpi
+export CCL_CONFIGURATION_PATH=""
+export CCL_CONFIGURATION=cpu_gpu_dpcpp
+export CCL_KVS_CONNECTION_TIMEOUT=600 
+
+export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
+export CCL_KVS_USE_MPI_RANKS=1
+
+
 export CCL_LOG_LEVEL="WARN"
 export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
 HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"

diff --git a/docs/aurora/data-science/frameworks/tensorflow.md b/docs/aurora/data-science/frameworks/tensorflow.md
@@ -13,17 +13,19 @@ module. To use it from a compute node, please do:
 
 ```
 module use /soft/modulefiles/
-module load frameworks/2023.12.15.001
+module load frameworks
 ```
 
 Then you can `import` TensorFlow as usual, the following is an output from the 
-`frameworks/2023.12.15.001` module:
+`frameworks` module:
 
 ```
 >>> import tensorflow as tf
 >>> tf.__version__
 '2.14.1'
 ```
+This import will fail on login nodes because there is no XPU on login nodes. 
+
 A simple but useful check could be to use TensorFlow to get device information 
 on a compute node. You can do this the following way:
 
@@ -199,22 +201,7 @@ Some Aurora specific details might be helpful to you.
 The following environmental variables should be set on the batch submission 
 script (PBSPro script) in the case of attempting to run beyond 16 nodes.
 
-```bash
-# This is a fix for running over 16 nodes:
-export FI_CXI_DEFAULT_CQ_SIZE=131072
-export FI_CXI_OFLOW_BUF_SIZE=8388608
-export FI_CXI_CQ_FILL_PERCENT=20
-
-export FI_LOG_LEVEL=warn
-#export FI_LOG_PROV=tcp
-export FI_LOG_PROV=cxi
-
-export MPIR_CVAR_ENABLE_GPU=0
-# This is to disable certain GPU optimizations like the use of XeLinks between
-# GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
-# overheads. Benefits are application dependent.
-export CCL_KVS_GET_TIMEOUT=600
-```
+--8<-- "./docs/aurora/data-science/frameworks/pytorch.md:commononecclenv"
 
 ### CPU Affinity
 
@@ -309,7 +296,6 @@ export FI_LOG_PROV=cxi
 # These allow for logging from a specific provider (libfabric)
 
 export MPIR_CVAR_ENABLE_GPU=0
-export CCL_KVS_GET_TIMEOUT=600
 
 #####################################################################
 # FRAMEWORK Variables that make a performance difference
@@ -327,7 +313,7 @@ export ITEX_FP32_MATH_MODE=TF32
 #####################################################################
 
 module use /soft/modulefiles
-module load frameworks/2023.12.15.001
+module load frameworks
 
 export NUMEXPR_NUM_THREADS=64
 # This is to resolve an issue due to a package called "numexpr".
@@ -338,6 +324,36 @@ export NUMEXPR_NUM_THREADS=64
 # or equal to '64' or to increase the 'NUMEXPR_MAX_THREADS' to the available
 # number of threads. Both of these variables can be set manually.
 
+
+## CCL setup
+export FI_CXI_DEFAULT_CQ_SIZE=131072
+export FI_CXI_OVFLOW_BUF_SIZE=8388608
+export FI_CXI_CQ_FILL_PERCENT=20
+
+export FI_LOG_LEVEL=warn
+#export FI_LOG_PROV=tcp
+export FI_LOG_PROV=cxi
+
+export CCL_KVS_GET_TIMEOUT=600
+
+export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
+export CPATH=$CCL_ROOT/include:$CPATH
+export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+
+export CCL_PROCESS_LAUNCHER=pmix  
+export CCL_ATL_TRANSPORT=mpi
+export CCL_ALLREDUCE=topo
+export CCL_ALLREDUCE_SCALEOUT=rabenseifner  # currently best allreduce algorithm at large scale
+export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
+
+export CCL_KVS_MODE=mpi
+export CCL_CONFIGURATION_PATH=""
+export CCL_CONFIGURATION=cpu_gpu_dpcpp
+export CCL_KVS_CONNECTION_TIMEOUT=600 
+
+export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
+export CCL_KVS_USE_MPI_RANKS=1
+
 #####################################################################
 # End of environment setup section
 #####################################################################
@@ -346,6 +362,7 @@ export NUMEXPR_NUM_THREADS=64
 # JOB LAUNCH
 ######################################################################
 
+
 export CCL_LOG_LEVEL="WARN"
 export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
 HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"

diff --git a/docs/polaris/applications-and-libraries/libraries/nccl.md b/docs/polaris/applications-and-libraries/libraries/nccl.md
@@ -0,0 +1,39 @@
+# NCCL 
+
+NVIDIA NCCL (pronounced "Nickel") is a standalone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.
+
+NCCL is a key library for scaling AI applications on Nvidia system. The conda module on Polaris are built with NCCL as the communication backend for distributed training. But HPC applications can also chose NCCL for communication over MPI. The library is available in the following folder: ```/soft/libraries/nccl```. 
+
+--8<-- [start:ncclenv]
+We have done extensive performance tests and identified the following best environment setup. 
+
+```bash
+export NCCL_NET_GDR_LEVEL=PHB
+export NCCL_CROSS_NIC=1
+export NCCL_COLLNET_ENABLE=1
+export NCCL_NET="AWS Libfabric"
+export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
+export FI_CXI_DISABLE_HOST_REGISTER=1
+export FI_MR_CACHE_MONITOR=userfaultfd
+export FI_CXI_DEFAULT_CQ_SIZE=131072
+```
+The key here is to enable AWS plugin (https://github.com/aws/aws-ofi-nccl). AWS OFI NCCL is a plug-in which enables EC2 developers to use libfabric as a network provider while running NVIDIA's NCCL based applications.
+
+This setup will lead to 2-3x performance improvement. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests. 
+
+As of now (October 29, 2024), these environment variable settings have been included by default in the `conda` modules on Polaris. A user can confirm this via: 
+```bash 
+module load conda
+env | grep NCCL
+env | grep FI
+```
+
+!!! warning
+    For some applications such as Megatron-DeepSpeed, enabling AWS plugin will cause hang or NCCL timeout issue. If so, please disable it by:
+    ```bash
+    unset NCCL_NET_GDR_LEVEL NCCL_CROSS_NIC NCCL_COLLNET_ENABLE NCCL_NET
+    ```
+--8<-- [end:ncclenv]
+
+
diff --git a/docs/polaris/data-science-workflows/frameworks/jax.md b/docs/polaris/data-science-workflows/frameworks/jax.md
@@ -9,16 +9,16 @@ JAX is installed on Polaris via the `conda` module, available with:
 module load conda; conda activate
 ```
 
-Then, you can load JAX in `python` as usual (below showing results from the `conda/2022-07-19` module):
+Then, you can load JAX in `python` as usual (below showing results from the `conda/2024-04-29` module):
 
 ```python
 >>> import jax
 >>> jax.__version__
-'0.3.15'
+'0.4.26'
 >>>
 ```
 
-## Notes on JAX 0.3.15
+## Notes on JAX 0.4.26
 
 On Polaris, due to a bug, an environment variable must be set to use JAX on GPUs.  The following code will crash:
 ```python

diff --git a/docs/polaris/data-science-workflows/frameworks/pytorch.md b/docs/polaris/data-science-workflows/frameworks/pytorch.md
@@ -12,20 +12,20 @@ module load conda
 conda activate
 ```
 
-Then, you can load PyTorch in `python` as usual (below showing results from the `conda/2022-07-19` module):
+Then, you can load PyTorch in `python` as usual (below showing results from the `conda/2024-04-29` module):
 
 ```python
 >>> import torch
 >>> torch.__version__
-'1.12.0a0+git67ece03'
+'2.3.0'
 >>>
 ```
 
-This installation of PyTorch was built from source and the cuda libraries it uses are found via the `CUDA_HOME` environment variable (below showing results from the `conda/2022-07-19` module):
+This installation of PyTorch was built from source and the cuda libraries it uses are found via the `CUDA_HOME` environment variable (below showing results from the `conda/2024-04-29` module):
 
 ```bash
 $ echo $CUDA_HOME
-/soft/datascience/cuda/cuda_11.5.2_495.29.05_linux
+/soft/compilers/cudatoolkit/cuda-12.4.1/
 ```
 
 If you need to build applications that use this version of PyTorch and CUDA, we recommend using these cuda libraries to ensure compatibility.  We periodically update the PyTorch release, though updates will come in the form of new versions of the `conda` module.
@@ -47,20 +47,28 @@ When running PyTorch applications, we have found the following practices to be g
 
 PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes.  Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs.  Good performance with PyTorch has been seen with both DDP and Horovod.  For details, please see the [Horovod documentation](https://horovod.readthedocs.io/en/stable/pytorch.html) or the [Distributed Data Parallel documentation](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).  Some Polaris-specific details that may be helpful to you:
 
-1. CPU affinity and NCCL settings can improve scaling performance, particularly at the largest scales.  In particular, we encourage users to try their scaling measurements with the following settings:
- - Set the environment variable `NCCL_COLLNET_ENABLE=1`
- - Set the environment varialbe `NCCL_NET_GDR_LEVEL=PHB`
- - Manually set the CPU affinity via mpiexec, such as with `--cpu-bind verbose,list:0,8,16,24
-`
+--8<-- [start:scalingsetup]
+1. CPU affinity can improve performance, particularly for data loading process.  In particular, we encourage users to try their scaling measurements by manually setting the CPU affinity via mpiexec, such as with `--cpu-bind verbose,list:0,8,16,24` or `--cpu-bind depth -d 16`. 
+
+2. NCCL settings: 
+--8<-- "./docs/polaris/applications-and-libraries/libraries/nccl.md:ncclenv"
+
+3. CUDA device setting: it works best when you limit the visible devices to only one GPU.  Note that if you import `mpi4py` or `horovod`, and then do something like `os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()`, it may not actually work!  You must set the `CUDA_VISIBLE_DEVICES` environment variable prior to doing `MPI.COMM_WORLD.init()`, which is done in `horovod.init()` as well as implicitly in `from mpi4py import MPI`.   On Polaris specifically, you can use the environment variable `PMI_LOCAL_RANK` (as well as `PMI_LOCAL_SIZE`) to learn information about the node-local MPI ranks.  
+--8<-- [end:scalingsetup]
 
-2. Horovod and DDP work best when you limit the visible devices to only one GPU.  Note that if you import `mpi4py` or `horovod`, and then do something like `os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()`, it may not actually work!  You must set the `CUDA_VISIBLE_DEVICES` environment variable prior to doing `MPI.COMM_WORLD.init()`, which is done in `horovod.init()` as well as implicitly in `from mpi4py import MPI`.   On Polaris specifically, you can use the environment variable `PMI_LOCAL_RANK` (as well as `PMI_LOCAL_SIZE`) to learn information about the node-local MPI ranks.  
 
 ### DeepSpeed
 
 DeepSpeed is also available and usable on Polaris.  For more information, please see the [DeepSpeed](./deepspeed.md) documentation directly.
 
 ## PyTorch `DataLoader` and multi-node Horovod
 
-Please note there is a bug that causes a hang when using PyTorch's multithreaded data loaders with distributed training across multiple nodes. To workaround this, NVIDIA recommends setting `num_workers=0` in the dataloader configuration, which serializes data loading. 
+For best performance, it is crucial to enable multiple workers in the data loader to avoid compute and I/O overlap and concurrent loading of dataset. This can be set by tunning "num_workers" parameter in ```DataLoader``` (see https://pytorch.org/docs/stable/data.html). Accordingly to our experience, generally, one can set 4 or 8 for best performance. Due to the total number of CPU cores available on a node, the maximum number of workers one can choose is 16. It is always to tune this value and find the optimal setup for your own application. 
+
+Aside from this, one also have to make sure that the worker threads spread over different CPU codes. To do this one has to specify the CPU binding to be `depth` and choose a depth value larger than ```num_workers``` through the following flag in the ```mpiexec``` command: 
+
+```
+mpiexec -np $NUM_GPUS -ppn 4 --cpu-bind depth -d 16 python3 ...
+```
 
-For more details, see [Polaris Known Issues](../../known-issues.md).
+Before 2024, enabling multiple workers would cause a fatal hang, but this has been addressed after an OS upgrade on Polaris.