Skip to content

Commit

Permalink
Merge pull request #378 from argonne-lcf/feature/Updates-for-Cerebras…
Browse files Browse the repository at this point in the history
…-r-2.1.1

Feature/updates for cerebras r 2.1.1
  • Loading branch information
vksastry authored Mar 21, 2024
2 parents f8a7070 + 2a330bf commit 92e95f4
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 33 deletions.
10 changes: 5 additions & 5 deletions docs/ai-testbed/cerebras/customizing-environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@
```console
#Make your home directory navigable
chmod a+xr ~/
mkdir ~/R_2.0.3
chmod a+x ~/R_2.0.3/
cd ~/R_2.0.3
mkdir ~/R_2.1.1
chmod a+x ~/R_2.1.1/
cd ~/R_2.1.1
# Note: "deactivate" does not actually work in scripts.
deactivate
rm -r venv_cerebras_pt
/software/cerebras/python3.8/bin/python3.8 -m venv venv_cerebras_pt
source venv_cerebras_pt/bin/activate
pip install --upgrade pip
pip install cerebras_pytorch==2.0.2
pip install cerebras_pytorch==2.1.1
```

<!--- No longer any TensorFlow wheel
Expand All @@ -28,7 +28,7 @@ pip install cerebras_pytorch==2.0.2
To activate a virtual environments

```console
source ~/R_2.0.3/venv_cerebras_pt/bin/activate
source ~/R_2.1.1/venv_cerebras_pt/bin/activate
```

To deactivate a virtual environment,
Expand Down
41 changes: 22 additions & 19 deletions docs/ai-testbed/cerebras/example-programs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,37 +4,39 @@
Make a working directory and a local copy of the Cerebras **modelzoo** and **anl_shared** repository, if not previously done, as follows.

```bash
mkdir ~/R_2.0.3
cd ~/R_2.0.3
mkdir ~/R_2.1.1
cd ~/R_2.1.1
git clone https://github.com/Cerebras/modelzoo.git
cd modelzoo
git tag
git checkout Release_2.0.3
git checkout Release_2.1.1
```
<!---
cp -r /software/cerebras/model_zoo/anl_shared/ ~/R_2.0.3/anl_shared
cp -r /software/cerebras/model_zoo/anl_shared/ ~/R_2.1.1/anl_shared
--->

<!---
## UNet
An implementation of this: [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/pdf/1505.04597.pdf), Ronneberger et. al 2015<br>
To run Unet with the <a href="https://www.kaggle.com/c/severstal-steel-defect-detection">Severstal: Steel Defect Detection</a> kaggle dataset, using a pre-downloaded copy of the dataset:<br>
First, source a Cerebras PyTorch virtual environment and make sure that requirements are installed.
```console
source ~/R_2.0.3/venv_cerebras_pt/bin/activate
pip install -r ~/R_2.0.3/modelzoo/requirements.txt
source ~/R_2.1.1/venv_cerebras_pt/bin/activate
pip install -r ~/R_2.1.1/modelzoo/requirements.txt
```
Then
```console
cd ~/R_2.0.3/modelzoo/modelzoo/vision/pytorch/unet
cd ~/R_2.1.1/modelzoo/modelzoo/vision/pytorch/unet
cp /software/cerebras/dataset/severstal-steel-defect-detection/params_severstal_binary_rawds.yaml configs/params_severstal_binary_rawds.yaml
export MODEL_DIR=model_dir_unet
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.0.3/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.1.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```
--->

<!--- Appears to not have been ported to 1.7.1
## BraggNN
Expand All @@ -47,7 +49,7 @@ The BraggNN model has two versions:<br>
```console
TODO
cd ~/R_2.0.3/anl_shared/braggnn/tf
cd ~/R_2.1.1/anl_shared/braggnn/tf
# This yaml has a correct path to a BraggNN dataset
cp /software/cerebras/dataset/BraggN/params_bragg_nonlocal_sampleds.yaml configs/params_bragg_nonlocal_sampleds.yaml
export MODEL_DIR=model_dir_braggnn
Expand All @@ -67,20 +69,20 @@ source /software/cerebras/venvs/venv_cerebras_pt/bin/activate
# or your personal venv
--->
```console
source ~/R_2.0.3/venv_cerebras_pt/bin/activate
pip install -r ~/R_2.0.3/modelzoo/requirements.txt
source ~/R_2.1.1/venv_cerebras_pt/bin/activate
pip install -r ~/R_2.1.1/modelzoo/requirements.txt
```

Then

```console
cd ~/R_2.0.3/modelzoo/modelzoo/transformers/pytorch/bert
cd ~/R_2.1.1/modelzoo/modelzoo/transformers/pytorch/bert
cp /software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml configs/bert_large_MSL128_sampleds.yaml
export MODEL_DIR=model_dir_bert_large_pytorch
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.0.3/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.1.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```
Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.0.3/modelzoo/modelzoo/transformers/vocab/google_research_uncased_L-12_H-768_A-12.txt`.
Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.1.1/modelzoo/modelzoo/transformers/vocab/google_research_uncased_L-12_H-768_A-12.txt`.

The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.

Expand All @@ -102,7 +104,7 @@ The last parts of the output should resemble the following, with messages about
2023-11-29 20:13:25,691 INFO: Training completed successfully!
2023-11-29 20:13:25,691 INFO: Processed 1024000 sample(s) in 336.373620536 seconds.
```

<!---
## GPT-J PyTorch
GPT-J [[github]](https://github.com/kingoflolz/mesh-transformer-jax) is an auto-regressive language model created by [EleutherAI](https://www.eleuther.ai/).
Expand All @@ -111,18 +113,18 @@ This PyTorch GPT-J 6B parameter pretraining sample uses 2 CS2s.
First, source a Cerebras PyTorch virtual environment and make sure that the requirements are installed:
```console
source ~/R_2.0.3/venv_cerebras_pt/bin/activate
pip install -r ~/R_2.0.3/modelzoo/requirements.txt
source ~/R_2.1.1/venv_cerebras_pt/bin/activate
pip install -r ~/R_2.1.1/modelzoo/requirements.txt
```
Then
```console
cd ~/R_2.0.3/modelzoo/modelzoo/transformers/pytorch/gptj
cd ~/R_2.1.1/modelzoo/modelzoo/transformers/pytorch/gptj
cp /software/cerebras/dataset/gptj/params_gptj_6B_sampleds.yaml configs/params_gptj_6B_sampleds.yaml
export MODEL_DIR=model_dir_gptj
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=2 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.0.3/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=2 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.1.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```
The last parts of the output should resemble the following:
Expand All @@ -137,3 +139,4 @@ The last parts of the output should resemble the following:
2023-11-29 21:14:30,142 INFO: Training completed successfully!
2023-11-29 21:14:30,142 INFO: Processed 24000 sample(s) in 910.883781998 seconds.
```
--->
18 changes: 9 additions & 9 deletions docs/ai-testbed/cerebras/running-a-model-or-program.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,29 +25,29 @@ Follow these instructions to compile and train the `fc_mnist` PyTorch sample. Th

First, make a virtual environment for Cerebras for PyTorch.
See [Customizing Environments](./customizing-environment.md) for the procedures for making PyTorch virtual environments for Cerebras.
If an environment is made in ```~/R_2.0.3/```, it they would be activated as follows:
If an environment is made in ```~/R_2.1.1/```, it would be activated as follows:
```console
source ~/R_2.0.3/venv_cerebras_pt/bin/activate
source ~/R_2.1.1/venv_cerebras_pt/bin/activate
```

### Clone the Cerebras modelzoo

```console
mkdir ~/R_2.0.3
cd ~/R_2.0.3
mkdir ~/R_2.1.1
cd ~/R_2.1.1
git clone https://github.com/Cerebras/modelzoo.git
cd modelzoo
git tag
git checkout Release_2.0.3
git checkout Release_2.1.1
```
## Running a Pytorch sample

### Activate your PyTorch virtual environment, install modelzoo requirements, and change to the working directory

```console
source ~/R_2.0.3/venv_cerebras_pt/bin/activate
pip install -r ~/R_2.0.3/modelzoo/requirements.txt
cd ~/R_2.0.3/modelzoo/modelzoo/fc_mnist/pytorch
source ~/R_2.1.1/venv_cerebras_pt/bin/activate
pip install -r ~/R_2.1.1/modelzoo/requirements.txt
cd ~/R_2.1.1/modelzoo/modelzoo/fc_mnist/pytorch
```

Next, edit configs/params.yaml, making the following changes:
Expand Down Expand Up @@ -76,7 +76,7 @@ To run the sample:
export MODEL_DIR=model_dir
# deletion of the model_dir is only needed if sample has been previously run
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX --job_labels name=pt_smoketest --params configs/params.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.0.3/modelzoo --compile_dir /$(whoami) |& tee mytest.log
python run.py CSX --job_labels name=pt_smoketest --params configs/params.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.1.1/modelzoo --compile_dir /$(whoami) |& tee mytest.log
```

A successful fc_mnist PyTorch training run should finish with output resembling the following:
Expand Down

0 comments on commit 92e95f4

Please sign in to comment.