Skip to content

Commit

Permalink
Merge pull request #502 from zhenghh04/main
Browse files Browse the repository at this point in the history
Updated frameworks documentation pages
  • Loading branch information
felker authored Nov 8, 2024
2 parents 959855a + af8bc01 commit 071218e
Show file tree
Hide file tree
Showing 8 changed files with 156 additions and 77 deletions.
21 changes: 10 additions & 11 deletions docs/aurora/data-science/frameworks/oneCCL.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,12 @@ kaushikvelusamy@aurora-uan-0012:~> module load frameworks
/opt/aurora/24.180.0/CNDA/oneapi/ccl/2021.13.1_20240808.145507
```


--8<-- [start:onecclenv]
**OneCCL mandatory environment variables**

```bash
module load frameworks
echo $CCL_ROOT
export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
export CPATH=$CCL_ROOT/include:$CPATH
export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
The parameters below are recommended to be set all the time as it eigher gives the best performance for all applications or are requires to address potential hang / crash at large scale.

```bash
export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE=topo
Expand All @@ -41,9 +37,15 @@ export CCL_KVS_CONNECTION_TIMEOUT=600

export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1

export MPI_PROVIDER=$FI_PROVIDER
unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
```

**OneCCL optional environment variables**
The impact of the following environment variable might be application dependent. Users are encourage to try to set them and see whether they help their applications.

```bash
ulimit -c unlimited
Expand All @@ -53,17 +55,14 @@ export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_DEFAULT_CQ_SIZE=1048576
export FI_CXI_CQ_FILL_PERCENT=30
export MPI_PROVIDER=$FI_PROVIDER
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
export INTELGT_AUTO_ATTACH_DISABLE=1
export PALS_PING_PERIOD=240
export PALS_RPC_TIMEOUT=240
export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
export CCL_OP_SYNC=1 #to avoid potential hang at large scale
```

--8<-- [end:onecclenv]

**Algorithm selection**

Expand Down
59 changes: 40 additions & 19 deletions docs/aurora/data-science/frameworks/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ the frameworks module. To use it from a compute node, please load the following

```
module use /soft/modulefiles/
module load frameworks/2023.12.15.001
module load frameworks
```
Then you can `import` PyTorch as usual, the following is an output from the
`frameworks/2023.12.15.001` module
`frameworks` module

```
>>> import torch
>>> torch.__version__
'2.0.1a0+cxx11.abi'
'2.3.1+cxx11.abi'
```
A simple but useful check could be to use PyTorch to get device information on
a compute node. You can do this the following way:
Expand Down Expand Up @@ -128,22 +128,12 @@ Some of the Aurora specific details might be helpful to you:
The following environmental variables should be set on the batch submission
script (PBSPro script) in the case of attempting to run beyond 16 nodes.

```shell
# This is a fix for running over 16 nodes:
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20
--8<-- [start:commononecclenv]
#### oneCCL environment variable
--8<-- "./docs/aurora/data-science/frameworks/oneCCL.md:onecclenv"

export FI_LOG_LEVEL=warn
#export FI_LOG_PROV=tcp
export FI_LOG_PROV=cxi

export MPIR_CVAR_ENABLE_GPU=0
# This is to disable certain GPU optimizations like the use of XeLinks between
# GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
# overheads. Benefits are application dependent.
export CCL_KVS_GET_TIMEOUT=600
```
These environment variable settings will probably be included in the framework module file in the future. But for now, users need to explicitly set these in the submission script.
--8<-- [end:commononecclenv]

In order to run an application with `TF32` precision type, one must set the
following environmental parameter:
Expand Down Expand Up @@ -314,7 +304,7 @@ export IPEX_FP32_MATH_MODE=TF32
#####################################################################

module use /soft/modulefiles
module load frameworks/2023.12.15.001
module load frameworks

export NUMEXPR_NUM_THREADS=64
# This is to resolve an issue due to a package called "numexpr".
Expand All @@ -333,6 +323,37 @@ export NUMEXPR_NUM_THREADS=64
# JOB LAUNCH
######################################################################


## CCL setup
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OVFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

export FI_LOG_LEVEL=warn
#export FI_LOG_PROV=tcp
export FI_LOG_PROV=cxi

export CCL_KVS_GET_TIMEOUT=600

export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
export CPATH=$CCL_ROOT/include:$CPATH
export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH

export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE=topo
export CCL_ALLREDUCE_SCALEOUT=rabenseifner # currently best allreduce algorithm at large scale
export CCL_BCAST=double_tree # currently best bcast algorithm at large scale

export CCL_KVS_MODE=mpi
export CCL_CONFIGURATION_PATH=""
export CCL_CONFIGURATION=cpu_gpu_dpcpp
export CCL_KVS_CONNECTION_TIMEOUT=600

export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1


export CCL_LOG_LEVEL="WARN"
export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"
Expand Down
57 changes: 37 additions & 20 deletions docs/aurora/data-science/frameworks/tensorflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,19 @@ module. To use it from a compute node, please do:

```
module use /soft/modulefiles/
module load frameworks/2023.12.15.001
module load frameworks
```

Then you can `import` TensorFlow as usual, the following is an output from the
`frameworks/2023.12.15.001` module:
`frameworks` module:

```
>>> import tensorflow as tf
>>> tf.__version__
'2.14.1'
```
This import will fail on login nodes because there is no XPU on login nodes.

A simple but useful check could be to use TensorFlow to get device information
on a compute node. You can do this the following way:

Expand Down Expand Up @@ -199,22 +201,7 @@ Some Aurora specific details might be helpful to you.
The following environmental variables should be set on the batch submission
script (PBSPro script) in the case of attempting to run beyond 16 nodes.

```bash
# This is a fix for running over 16 nodes:
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

export FI_LOG_LEVEL=warn
#export FI_LOG_PROV=tcp
export FI_LOG_PROV=cxi

export MPIR_CVAR_ENABLE_GPU=0
# This is to disable certain GPU optimizations like the use of XeLinks between
# GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
# overheads. Benefits are application dependent.
export CCL_KVS_GET_TIMEOUT=600
```
--8<-- "./docs/aurora/data-science/frameworks/pytorch.md:commononecclenv"

### CPU Affinity

Expand Down Expand Up @@ -309,7 +296,6 @@ export FI_LOG_PROV=cxi
# These allow for logging from a specific provider (libfabric)

export MPIR_CVAR_ENABLE_GPU=0
export CCL_KVS_GET_TIMEOUT=600

#####################################################################
# FRAMEWORK Variables that make a performance difference
Expand All @@ -327,7 +313,7 @@ export ITEX_FP32_MATH_MODE=TF32
#####################################################################

module use /soft/modulefiles
module load frameworks/2023.12.15.001
module load frameworks

export NUMEXPR_NUM_THREADS=64
# This is to resolve an issue due to a package called "numexpr".
Expand All @@ -338,6 +324,36 @@ export NUMEXPR_NUM_THREADS=64
# or equal to '64' or to increase the 'NUMEXPR_MAX_THREADS' to the available
# number of threads. Both of these variables can be set manually.


## CCL setup
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OVFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

export FI_LOG_LEVEL=warn
#export FI_LOG_PROV=tcp
export FI_LOG_PROV=cxi

export CCL_KVS_GET_TIMEOUT=600

export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
export CPATH=$CCL_ROOT/include:$CPATH
export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH

export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE=topo
export CCL_ALLREDUCE_SCALEOUT=rabenseifner # currently best allreduce algorithm at large scale
export CCL_BCAST=double_tree # currently best bcast algorithm at large scale

export CCL_KVS_MODE=mpi
export CCL_CONFIGURATION_PATH=""
export CCL_CONFIGURATION=cpu_gpu_dpcpp
export CCL_KVS_CONNECTION_TIMEOUT=600

export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1

#####################################################################
# End of environment setup section
#####################################################################
Expand All @@ -346,6 +362,7 @@ export NUMEXPR_NUM_THREADS=64
# JOB LAUNCH
######################################################################


export CCL_LOG_LEVEL="WARN"
export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"
Expand Down
39 changes: 39 additions & 0 deletions docs/polaris/applications-and-libraries/libraries/nccl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# NCCL

NVIDIA NCCL (pronounced "Nickel") is a standalone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

NCCL is a key library for scaling AI applications on Nvidia system. The conda module on Polaris are built with NCCL as the communication backend for distributed training. But HPC applications can also chose NCCL for communication over MPI. The library is available in the following folder: ```/soft/libraries/nccl```.

--8<-- [start:ncclenv]
We have done extensive performance tests and identified the following best environment setup.

```bash
export NCCL_NET_GDR_LEVEL=PHB
export NCCL_CROSS_NIC=1
export NCCL_COLLNET_ENABLE=1
export NCCL_NET="AWS Libfabric"
export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
```
The key here is to enable AWS plugin (https://github.com/aws/aws-ofi-nccl). AWS OFI NCCL is a plug-in which enables EC2 developers to use libfabric as a network provider while running NVIDIA's NCCL based applications.

This setup will lead to 2-3x performance improvement. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests.

As of now (October 29, 2024), these environment variable settings have been included by default in the `conda` modules on Polaris. A user can confirm this via:
```bash
module load conda
env | grep NCCL
env | grep FI
```

!!! warning
For some applications such as Megatron-DeepSpeed, enabling AWS plugin will cause hang or NCCL timeout issue. If so, please disable it by:
```bash
unset NCCL_NET_GDR_LEVEL NCCL_CROSS_NIC NCCL_COLLNET_ENABLE NCCL_NET
```
--8<-- [end:ncclenv]


6 changes: 3 additions & 3 deletions docs/polaris/data-science-workflows/frameworks/jax.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,16 @@ JAX is installed on Polaris via the `conda` module, available with:
module load conda; conda activate
```

Then, you can load JAX in `python` as usual (below showing results from the `conda/2022-07-19` module):
Then, you can load JAX in `python` as usual (below showing results from the `conda/2024-04-29` module):

```python
>>> import jax
>>> jax.__version__
'0.3.15'
'0.4.26'
>>>
```

## Notes on JAX 0.3.15
## Notes on JAX 0.4.26

On Polaris, due to a bug, an environment variable must be set to use JAX on GPUs. The following code will crash:
```python
Expand Down
32 changes: 20 additions & 12 deletions docs/polaris/data-science-workflows/frameworks/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,20 @@ module load conda
conda activate
```

Then, you can load PyTorch in `python` as usual (below showing results from the `conda/2022-07-19` module):
Then, you can load PyTorch in `python` as usual (below showing results from the `conda/2024-04-29` module):

```python
>>> import torch
>>> torch.__version__
'1.12.0a0+git67ece03'
'2.3.0'
>>>
```

This installation of PyTorch was built from source and the cuda libraries it uses are found via the `CUDA_HOME` environment variable (below showing results from the `conda/2022-07-19` module):
This installation of PyTorch was built from source and the cuda libraries it uses are found via the `CUDA_HOME` environment variable (below showing results from the `conda/2024-04-29` module):

```bash
$ echo $CUDA_HOME
/soft/datascience/cuda/cuda_11.5.2_495.29.05_linux
/soft/compilers/cudatoolkit/cuda-12.4.1/
```

If you need to build applications that use this version of PyTorch and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the PyTorch release, though updates will come in the form of new versions of the `conda` module.
Expand All @@ -47,20 +47,28 @@ When running PyTorch applications, we have found the following practices to be g

PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs. Good performance with PyTorch has been seen with both DDP and Horovod. For details, please see the [Horovod documentation](https://horovod.readthedocs.io/en/stable/pytorch.html) or the [Distributed Data Parallel documentation](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). Some Polaris-specific details that may be helpful to you:

1. CPU affinity and NCCL settings can improve scaling performance, particularly at the largest scales. In particular, we encourage users to try their scaling measurements with the following settings:
- Set the environment variable `NCCL_COLLNET_ENABLE=1`
- Set the environment varialbe `NCCL_NET_GDR_LEVEL=PHB`
- Manually set the CPU affinity via mpiexec, such as with `--cpu-bind verbose,list:0,8,16,24
`
--8<-- [start:scalingsetup]
1. CPU affinity can improve performance, particularly for data loading process. In particular, we encourage users to try their scaling measurements by manually setting the CPU affinity via mpiexec, such as with `--cpu-bind verbose,list:0,8,16,24` or `--cpu-bind depth -d 16`.

2. NCCL settings:
--8<-- "./docs/polaris/applications-and-libraries/libraries/nccl.md:ncclenv"

3. CUDA device setting: it works best when you limit the visible devices to only one GPU. Note that if you import `mpi4py` or `horovod`, and then do something like `os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()`, it may not actually work! You must set the `CUDA_VISIBLE_DEVICES` environment variable prior to doing `MPI.COMM_WORLD.init()`, which is done in `horovod.init()` as well as implicitly in `from mpi4py import MPI`. On Polaris specifically, you can use the environment variable `PMI_LOCAL_RANK` (as well as `PMI_LOCAL_SIZE`) to learn information about the node-local MPI ranks.
--8<-- [end:scalingsetup]

2. Horovod and DDP work best when you limit the visible devices to only one GPU. Note that if you import `mpi4py` or `horovod`, and then do something like `os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()`, it may not actually work! You must set the `CUDA_VISIBLE_DEVICES` environment variable prior to doing `MPI.COMM_WORLD.init()`, which is done in `horovod.init()` as well as implicitly in `from mpi4py import MPI`. On Polaris specifically, you can use the environment variable `PMI_LOCAL_RANK` (as well as `PMI_LOCAL_SIZE`) to learn information about the node-local MPI ranks.

### DeepSpeed

DeepSpeed is also available and usable on Polaris. For more information, please see the [DeepSpeed](./deepspeed.md) documentation directly.

## PyTorch `DataLoader` and multi-node Horovod

Please note there is a bug that causes a hang when using PyTorch's multithreaded data loaders with distributed training across multiple nodes. To workaround this, NVIDIA recommends setting `num_workers=0` in the dataloader configuration, which serializes data loading.
For best performance, it is crucial to enable multiple workers in the data loader to avoid compute and I/O overlap and concurrent loading of dataset. This can be set by tunning "num_workers" parameter in ```DataLoader``` (see https://pytorch.org/docs/stable/data.html). Accordingly to our experience, generally, one can set 4 or 8 for best performance. Due to the total number of CPU cores available on a node, the maximum number of workers one can choose is 16. It is always to tune this value and find the optimal setup for your own application.

Aside from this, one also have to make sure that the worker threads spread over different CPU codes. To do this one has to specify the CPU binding to be `depth` and choose a depth value larger than ```num_workers``` through the following flag in the ```mpiexec``` command:

```
mpiexec -np $NUM_GPUS -ppn 4 --cpu-bind depth -d 16 python3 ...
```

For more details, see [Polaris Known Issues](../../known-issues.md).
Before 2024, enabling multiple workers would cause a fatal hang, but this has been addressed after an OS upgrade on Polaris.
Loading

0 comments on commit 071218e

Please sign in to comment.