Running on AWS GPU instances #3

rsignell-usgs · 2021-06-21T20:22:27Z

I got this working today on AWS using No Tears Cluster and wanted to document it before I forget.

After creating the cluster (I chose us-east-2 since nobody uses it and I can get spot instances all the time), I logged in, installed miniconda using the IOOS Python instructions (but skipping the creation of the IOOS environment), installed mamba into the base environment and then did:

mamba env create -f tensorflow-gpu.yml

using this slightly modified version of the environment:

name: tensorflow-gpu
channels:
  - defaults
dependencies:
  - python
  - numba
  - numpy
  - nodejs
  - scipy
  - matplotlib
  - imageio
  - h5py
  - pandas
  - pip
  - mkl-service
  - scikit-image
  - scikit-learn
  - requests
  - tensorflow-gpu
  - ipykernel

The key was using only the defaults channel. If you include conda-forge in the channel list it pulls in the latest cudatoolkit package that doesn't work with the tensorflow-gpu from defaults. (tensorflow-gpu is not currently available on conda-forge).

I activated the environment: conda activate tensorflow-gpu and then followed the instructions for setting up Pangeo on HPC.

I then started an interactive session on the gpu partition, which spins up an instance to support the request, and then starts a terminal there:

srun --nodes=1 --partition=gpu --ntasks-per-node=1 --time=01:00:00 --pty bash -i

Then I ran a start_jupyter script (~/bin/start_jupyter), which looks like this:

cd /shared
JPORT=$(shuf -i 8400-9400 -n 1)
echo ""
echo ""
echo "Step 1: Wait until this script says the Jupyter server"
echo "        has started. "
echo ""
echo "Step 2: Copy this ssh command into a terminal on your"
echo "        local computer:"
echo ""
echo "        ssh -N -i $HOME/.ssh/AWS-HPC-Ohio -L 8889:`hostname`:$JPORT $USER@ec2-3-129-67-246.us-east-2.compute.amazonaws.com"
echo ""
echo "Step 3: Browse to https://localhost:8889 on your local computer"
echo ""
echo ""
sleep 2
jupyter lab --no-browser --ip=`hostname` --port=$JPORT

Then on my local windows machine, I opened up a git bash terminal and just paste in the ssh forwarding link echoed above.

Then I opened up localhost:8889 in my local browser and ran the notebook on the remote GPU instance!

Note: I discovered which version of cuda-toolkit to specify by doing:

$ nvidia-smi | grep NVIDIA-SMI    
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |

which told me I should not specify any cudatoolkit greater than 11.

Here's the proof that it worked: https://nbviewer.jupyter.org/gist/rsignell-usgs/1e1a7f3ae3483725dd8f78f4d02c023a

cc: @csherwood-usgs

The text was updated successfully, but these errors were encountered:

rsignell-usgs changed the title ~~Got this working on AWS~~ Running on AWS GPU instances Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on AWS GPU instances #3

Running on AWS GPU instances #3

rsignell-usgs commented Jun 21, 2021 •

edited

Loading

Running on AWS GPU instances #3

Running on AWS GPU instances #3

Comments

rsignell-usgs commented Jun 21, 2021 • edited Loading

rsignell-usgs commented Jun 21, 2021 •

edited

Loading