Skip to content

Commit

Permalink
Merge pull request #537 from argonne-lcf/hotfix/snippets-and-nav
Browse files Browse the repository at this point in the history
Fix capitalization in nav sidebar and fix pymdownx.snippets section magic left in rendered text
  • Loading branch information
felker authored Nov 8, 2024
2 parents 071218e + 6f7d659 commit a680538
Show file tree
Hide file tree
Showing 6 changed files with 25 additions and 25 deletions.
4 changes: 2 additions & 2 deletions docs/aurora/data-science/frameworks/oneCCL.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ kaushikvelusamy@aurora-uan-0012:~> module load frameworks
/opt/aurora/24.180.0/CNDA/oneapi/ccl/2021.13.1_20240808.145507
```

--8<-- [start:onecclenv]
<!-- --8<-- [start:onecclenv] -->
**OneCCL mandatory environment variables**

The parameters below are recommended to be set all the time as it eigher gives the best performance for all applications or are requires to address potential hang / crash at large scale.
Expand Down Expand Up @@ -62,7 +62,7 @@ export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue
export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
export CCL_OP_SYNC=1 #to avoid potential hang at large scale
```
--8<-- [end:onecclenv]
<!-- --8<-- [end:onecclenv] -->

**Algorithm selection**

Expand Down
4 changes: 2 additions & 2 deletions docs/aurora/data-science/frameworks/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,12 +128,12 @@ Some of the Aurora specific details might be helpful to you:
The following environmental variables should be set on the batch submission
script (PBSPro script) in the case of attempting to run beyond 16 nodes.

--8<-- [start:commononecclenv]
<!-- --8<-- [start:commononecclenv] -->
#### oneCCL environment variable
--8<-- "./docs/aurora/data-science/frameworks/oneCCL.md:onecclenv"

These environment variable settings will probably be included in the framework module file in the future. But for now, users need to explicitly set these in the submission script.
--8<-- [end:commononecclenv]
<!-- --8<-- [end:commononecclenv] -->

In order to run an application with `TF32` precision type, one must set the
following environmental parameter:
Expand Down
22 changes: 11 additions & 11 deletions docs/polaris/applications-and-libraries/libraries/nccl.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

NVIDIA NCCL (pronounced "Nickel") is a standalone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

NCCL is a key library for scaling AI applications on Nvidia system. The conda module on Polaris are built with NCCL as the communication backend for distributed training. But HPC applications can also chose NCCL for communication over MPI. The library is available in the following folder: ```/soft/libraries/nccl```.
NCCL is a key library for scaling AI applications on NVIDIA system. The Anaconda modules on Polaris are built with NCCL as the communication backend for distributed training of deep learning models. But HPC applications can also chose NCCL for communication over MPI. The library is available in the following folder: ```/soft/libraries/nccl```.

--8<-- [start:ncclenv]
<!-- --8<-- [start:ncclenv] -->
We have done extensive performance tests and identified the following best environment setup.

```bash
Expand All @@ -18,22 +18,22 @@ export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
```
The key here is to enable AWS plugin (https://github.com/aws/aws-ofi-nccl). AWS OFI NCCL is a plug-in which enables EC2 developers to use libfabric as a network provider while running NVIDIA's NCCL based applications.
The key here is to enable AWS plugin (https://github.com/aws/aws-ofi-nccl). AWS OFI NCCL is a plugin which enables EC2 developers to use libfabric as a network provider while running NVIDIA's NCCL based applications.

This setup will lead to 2-3x performance improvement. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests.
This setup can lead to 2-3x performance improvement for some communication workloads. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests.

As of now (October 29, 2024), these environment variable settings have been included by default in the `conda` modules on Polaris. A user can confirm this via:
```bash
module load conda
env | grep NCCL
env | grep FI
```
<!-- As of now (October 29, 2024), these environment variable settings have been included by default in the `conda` modules on Polaris. A user can confirm this via: -->
<!-- ```bash -->
<!-- module load conda -->
<!-- env | grep NCCL -->
<!-- env | grep FI -->
<!-- ``` -->

!!! warning
For some applications such as Megatron-DeepSpeed, enabling AWS plugin will cause hang or NCCL timeout issue. If so, please disable it by:
```bash
unset NCCL_NET_GDR_LEVEL NCCL_CROSS_NIC NCCL_COLLNET_ENABLE NCCL_NET
```
--8<-- [end:ncclenv]
<!-- --8<-- [end:ncclenv] -->


6 changes: 3 additions & 3 deletions docs/polaris/containers/containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ The job can be submitted using:
qsub -v CONTAINER=mpich-4_latest.sif job_submission.sh
```

--8<-- [start:commoncontainerdoc]
<!-- --8<-- [start:commoncontainerdoc] -->

## Recipe-Based Container Building

Expand Down Expand Up @@ -136,7 +136,7 @@ export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir/
mkdir $APPTAINER_CACHEDIR
```

* Make sure you are not on a directory accessed with a symlink, i.e. check if `pwd` and `pwd -P` returns the same path.
* Make sure you are not on a directory accessed with a symbolic link, i.e. check if `pwd` and `pwd -P` returns the same path.

* If any of the above doesn't work, try running the build in your home directory.

Expand All @@ -146,4 +146,4 @@ mkdir $APPTAINER_CACHEDIR

**Disabled Port mapping, user namespace and [network virtualization]** [Network virtualization](https://apptainer.org/docs/user/main/networking.html) is disabled for the container due to security constraints. See issue [#2533](https://github.com/apptainer/apptainer/issues/2553)

--8<-- [end:commoncontainerdoc]
<!-- --8<-- [end:commoncontainerdoc] -->
6 changes: 3 additions & 3 deletions docs/polaris/data-science-workflows/frameworks/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ PyTorch is also available through NVIDIA containers that have been translated to

When running PyTorch applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.

1. Use Reduced Precision. Reduced Precision is available on A100 via tensorcores and is supported with PyTorch operations. In general, the way to do this is via the PyTorch Automatic Mixed Precision package (AMP), as descibed in the [mixed precision documentation](https://pytorch.org/docs/stable/amp.html). In PyTorch, users generally need to manage casting and loss scaling manually, though context managers and function decorators can provide easy tools to do this.
1. Use Reduced Precision. Reduced Precision is available on A100 via tensorcores and is supported with PyTorch operations. In general, the way to do this is via the PyTorch Automatic Mixed Precision package (AMP), as described in the [mixed precision documentation](https://pytorch.org/docs/stable/amp.html). In PyTorch, users generally need to manage casting and loss scaling manually, though context managers and function decorators can provide easy tools to do this.

2. PyTorch has a `JIT` module as well as backends to support op fusion, similar to TensorFlow's `tf.function` tools. However, PyTorch JIT capabilities are newer and may not yield performance improvements. Please see [TorchScript](https://pytorch.org/docs/stable/jit.html) for more information.

Expand All @@ -47,14 +47,14 @@ When running PyTorch applications, we have found the following practices to be g

PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs. Good performance with PyTorch has been seen with both DDP and Horovod. For details, please see the [Horovod documentation](https://horovod.readthedocs.io/en/stable/pytorch.html) or the [Distributed Data Parallel documentation](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). Some Polaris-specific details that may be helpful to you:

--8<-- [start:scalingsetup]
<!-- --8<-- [start:scalingsetup] -->
1. CPU affinity can improve performance, particularly for data loading process. In particular, we encourage users to try their scaling measurements by manually setting the CPU affinity via mpiexec, such as with `--cpu-bind verbose,list:0,8,16,24` or `--cpu-bind depth -d 16`.

2. NCCL settings:
--8<-- "./docs/polaris/applications-and-libraries/libraries/nccl.md:ncclenv"

3. CUDA device setting: it works best when you limit the visible devices to only one GPU. Note that if you import `mpi4py` or `horovod`, and then do something like `os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()`, it may not actually work! You must set the `CUDA_VISIBLE_DEVICES` environment variable prior to doing `MPI.COMM_WORLD.init()`, which is done in `horovod.init()` as well as implicitly in `from mpi4py import MPI`. On Polaris specifically, you can use the environment variable `PMI_LOCAL_RANK` (as well as `PMI_LOCAL_SIZE`) to learn information about the node-local MPI ranks.
--8<-- [end:scalingsetup]
<!-- --8<-- [end:scalingsetup] -->


### DeepSpeed
Expand Down
8 changes: 4 additions & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ nav:
- Data Science:
- Python: sophia/data-science/python.md
- Fine-tuning with Autotrain: sophia/data-science/fine-tune-LLM-with-Autotrain.md
- Visualization:
- Visualization:
- Visualization on Sophia: sophia/visualization/visualization.md
- ParaView (Launch from Client): sophia/visualization/paraview.md
- AI Testbed:
Expand Down Expand Up @@ -185,7 +185,7 @@ nav:
- Copper: aurora/data-management/copper/copper.md
- DAOS: aurora/data-management/daos/daos-overview.md
- Lustre (Flare): aurora/data-management/lustre/flare.md
- Moving data to Aurora:
- Moving data to Aurora:
- DAOS data mover: aurora/data-management/moving_data_to_aurora/daos_datamover.md
- Globus: aurora/data-management/moving_data_to_aurora/globus.md
- SCP: aurora/data-management/moving_data_to_aurora/scp.md
Expand All @@ -209,13 +209,13 @@ nav:
- PyTorch: aurora/data-science/frameworks/pytorch.md
- TensorFlow: aurora/data-science/frameworks/tensorflow.md
- LibTorch: aurora/data-science/frameworks/libtorch.md
- OneCCL: aurora/data-science/frameworks/oneCCL.md
- oneCCL: aurora/data-science/frameworks/oneCCL.md
- Libraries:
- OpenVINO: aurora/data-science/libraries/openvino.md
- Programming Models:
- Kokkos: aurora/programming-models/kokkos-aurora.md
- Level Zero: aurora/programming-models/level-0.md
- openCL: aurora/programming-models/opencl-aurora.md
- OpenCL: aurora/programming-models/opencl-aurora.md
- OpenMP: aurora/programming-models/openmp-aurora.md
- RAJA: aurora/programming-models/raja-aurora.md
- SYCL: aurora/programming-models/sycl-aurora.md
Expand Down

0 comments on commit a680538

Please sign in to comment.