Merge pull request #378 from argonne-lcf/feature/Updates-for-Cerebras…

…-r-2.1.1 Feature/updates for cerebras r 2.1.1
argonne-lcf · Mar 21, 2024 · 92e95f4 · 92e95f4
2 parents f8a7070 + 2a330bf
commit 92e95f4
Show file tree

Hide file tree

Showing 3 changed files with 36 additions and 33 deletions.
diff --git a/docs/ai-testbed/cerebras/customizing-environment.md b/docs/ai-testbed/cerebras/customizing-environment.md
@@ -7,16 +7,16 @@
 ```console
 #Make your home directory navigable
 chmod a+xr ~/
-mkdir ~/R_2.0.3
-chmod a+x ~/R_2.0.3/
-cd ~/R_2.0.3
+mkdir ~/R_2.1.1
+chmod a+x ~/R_2.1.1/
+cd ~/R_2.1.1
 # Note: "deactivate" does not actually work in scripts.
 deactivate
 rm -r venv_cerebras_pt
 /software/cerebras/python3.8/bin/python3.8 -m venv venv_cerebras_pt
 source venv_cerebras_pt/bin/activate
 pip install --upgrade pip
-pip install cerebras_pytorch==2.0.2
+pip install cerebras_pytorch==2.1.1
 ```
 
 <!--- No longer any TensorFlow wheel
@@ -28,7 +28,7 @@ pip install cerebras_pytorch==2.0.2
 To activate a virtual environments
 
 ```console
-source ~/R_2.0.3/venv_cerebras_pt/bin/activate
+source ~/R_2.1.1/venv_cerebras_pt/bin/activate
 ```
 
 To deactivate a virtual environment,

diff --git a/docs/ai-testbed/cerebras/example-programs.md b/docs/ai-testbed/cerebras/example-programs.md
@@ -4,37 +4,39 @@
 Make a working directory and a local copy of the Cerebras **modelzoo** and **anl_shared** repository, if not previously done, as follows.
 
 ```bash
-mkdir ~/R_2.0.3
-cd ~/R_2.0.3
+mkdir ~/R_2.1.1
+cd ~/R_2.1.1
 git clone https://github.com/Cerebras/modelzoo.git
 cd modelzoo
 git tag
-git checkout Release_2.0.3
+git checkout Release_2.1.1
 ```
 <!---
-cp -r /software/cerebras/model_zoo/anl_shared/ ~/R_2.0.3/anl_shared
+cp -r /software/cerebras/model_zoo/anl_shared/ ~/R_2.1.1/anl_shared
 --->
 
+<!---
 ## UNet
 
 An implementation of this: [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/pdf/1505.04597.pdf), Ronneberger et.  al 2015<br>
 To run Unet with the <a href="https://www.kaggle.com/c/severstal-steel-defect-detection">Severstal: Steel Defect Detection</a> kaggle dataset, using a pre-downloaded copy of the dataset:<br>
 First, source a Cerebras PyTorch virtual environment and make sure that requirements are installed.
 
 ```console
-source ~/R_2.0.3/venv_cerebras_pt/bin/activate
-pip install -r ~/R_2.0.3/modelzoo/requirements.txt
+source ~/R_2.1.1/venv_cerebras_pt/bin/activate
+pip install -r ~/R_2.1.1/modelzoo/requirements.txt
 ```
 
 Then
 
 ```console
-cd ~/R_2.0.3/modelzoo/modelzoo/vision/pytorch/unet
+cd ~/R_2.1.1/modelzoo/modelzoo/vision/pytorch/unet
 cp /software/cerebras/dataset/severstal-steel-defect-detection/params_severstal_binary_rawds.yaml configs/params_severstal_binary_rawds.yaml
 export MODEL_DIR=model_dir_unet
 if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
-python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.0.3/modelzoo/ --compile_dir $(whoami) |& tee mytest.log 
+python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.1.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log 
 ```
+--->
 
 <!--- Appears to not have been ported to 1.7.1
 ## BraggNN
@@ -47,7 +49,7 @@ The BraggNN model has two versions:<br>
 
 ```console
 TODO
-cd ~/R_2.0.3/anl_shared/braggnn/tf
+cd ~/R_2.1.1/anl_shared/braggnn/tf
 # This yaml has a correct path to a BraggNN dataset
 cp /software/cerebras/dataset/BraggN/params_bragg_nonlocal_sampleds.yaml configs/params_bragg_nonlocal_sampleds.yaml
 export MODEL_DIR=model_dir_braggnn
@@ -67,20 +69,20 @@ source /software/cerebras/venvs/venv_cerebras_pt/bin/activate
 # or your personal venv
 --->
 ```console
-source ~/R_2.0.3/venv_cerebras_pt/bin/activate
-pip install -r ~/R_2.0.3/modelzoo/requirements.txt
+source ~/R_2.1.1/venv_cerebras_pt/bin/activate
+pip install -r ~/R_2.1.1/modelzoo/requirements.txt
 ```
 
 Then
 
 ```console
-cd ~/R_2.0.3/modelzoo/modelzoo/transformers/pytorch/bert
+cd ~/R_2.1.1/modelzoo/modelzoo/transformers/pytorch/bert
 cp /software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml configs/bert_large_MSL128_sampleds.yaml
 export MODEL_DIR=model_dir_bert_large_pytorch
 if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
-python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.0.3/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
+python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.1.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
 ```
-Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.0.3/modelzoo/modelzoo/transformers/vocab/google_research_uncased_L-12_H-768_A-12.txt`. 
+Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.1.1/modelzoo/modelzoo/transformers/vocab/google_research_uncased_L-12_H-768_A-12.txt`. 
 
 The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.
 
@@ -102,7 +104,7 @@ The last parts of the output should resemble the following, with messages about
 2023-11-29 20:13:25,691 INFO:   Training completed successfully!
 2023-11-29 20:13:25,691 INFO:   Processed 1024000 sample(s) in 336.373620536 seconds.
 ```
-
+<!---
 ## GPT-J PyTorch
 
 GPT-J [[github]](https://github.com/kingoflolz/mesh-transformer-jax) is an auto-regressive language model created by [EleutherAI](https://www.eleuther.ai/).
@@ -111,18 +113,18 @@ This PyTorch GPT-J 6B parameter pretraining sample uses 2 CS2s.
 First, source a Cerebras PyTorch virtual environment and make sure that the requirements are installed:
 
 ```console
-source ~/R_2.0.3/venv_cerebras_pt/bin/activate
-pip install -r ~/R_2.0.3/modelzoo/requirements.txt
+source ~/R_2.1.1/venv_cerebras_pt/bin/activate
+pip install -r ~/R_2.1.1/modelzoo/requirements.txt
 ```
 
 Then
 
 ```console
-cd ~/R_2.0.3/modelzoo/modelzoo/transformers/pytorch/gptj
+cd ~/R_2.1.1/modelzoo/modelzoo/transformers/pytorch/gptj
 cp /software/cerebras/dataset/gptj/params_gptj_6B_sampleds.yaml configs/params_gptj_6B_sampleds.yaml
 export MODEL_DIR=model_dir_gptj
 if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
-python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=2 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.0.3/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
+python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=2 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.1.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
 ```
 
 The last parts of the output should resemble the following:
@@ -137,3 +139,4 @@ The last parts of the output should resemble the following:
 2023-11-29 21:14:30,142 INFO:   Training completed successfully!
 2023-11-29 21:14:30,142 INFO:   Processed 24000 sample(s) in 910.883781998 seconds.
 ```
+--->
diff --git a/docs/ai-testbed/cerebras/running-a-model-or-program.md b/docs/ai-testbed/cerebras/running-a-model-or-program.md
@@ -25,29 +25,29 @@ Follow these instructions to compile and train the `fc_mnist` PyTorch sample. Th
 
 First, make a virtual environment for Cerebras for PyTorch.
 See [Customizing Environments](./customizing-environment.md) for the procedures for making PyTorch virtual environments for Cerebras.
-If an environment is made in ```~/R_2.0.3/```, it they would be activated as follows:
+If an environment is made in ```~/R_2.1.1/```, it would be activated as follows:
 ```console
-source ~/R_2.0.3/venv_cerebras_pt/bin/activate
+source ~/R_2.1.1/venv_cerebras_pt/bin/activate
 ```
 
 ### Clone the Cerebras modelzoo
 
 ```console
-mkdir ~/R_2.0.3
-cd ~/R_2.0.3
+mkdir ~/R_2.1.1
+cd ~/R_2.1.1
 git clone https://github.com/Cerebras/modelzoo.git
 cd modelzoo
 git tag
-git checkout Release_2.0.3
+git checkout Release_2.1.1
 ```
 ## Running a Pytorch sample
 
 ### Activate your PyTorch virtual environment, install modelzoo requirements, and change to the working directory
 
 ```console
-source ~/R_2.0.3/venv_cerebras_pt/bin/activate
-pip install -r ~/R_2.0.3/modelzoo/requirements.txt
-cd ~/R_2.0.3/modelzoo/modelzoo/fc_mnist/pytorch
+source ~/R_2.1.1/venv_cerebras_pt/bin/activate
+pip install -r ~/R_2.1.1/modelzoo/requirements.txt
+cd ~/R_2.1.1/modelzoo/modelzoo/fc_mnist/pytorch
 ```
 
 Next, edit configs/params.yaml, making the following changes:
@@ -76,7 +76,7 @@ To run the sample:
 export MODEL_DIR=model_dir
 # deletion of the model_dir is only needed if sample has been previously run
 if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
-python run.py CSX --job_labels name=pt_smoketest --params configs/params.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.0.3/modelzoo --compile_dir /$(whoami) |& tee mytest.log
+python run.py CSX --job_labels name=pt_smoketest --params configs/params.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.1.1/modelzoo --compile_dir /$(whoami) |& tee mytest.log
 ```
 
 A successful fc_mnist PyTorch training run should finish with output resembling the following: