Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on AWS GPU instances #3

Open
rsignell-usgs opened this issue Jun 21, 2021 · 0 comments
Open

Running on AWS GPU instances #3

rsignell-usgs opened this issue Jun 21, 2021 · 0 comments

Comments

@rsignell-usgs
Copy link

rsignell-usgs commented Jun 21, 2021

I got this working today on AWS using No Tears Cluster and wanted to document it before I forget.

After creating the cluster (I chose us-east-2 since nobody uses it and I can get spot instances all the time), I logged in, installed miniconda using the IOOS Python instructions (but skipping the creation of the IOOS environment), installed mamba into the base environment and then did:

mamba env create -f tensorflow-gpu.yml

using this slightly modified version of the environment:

name: tensorflow-gpu
channels:
  - defaults
dependencies:
  - python
  - numba
  - numpy
  - nodejs
  - scipy
  - matplotlib
  - imageio
  - h5py
  - pandas
  - pip
  - mkl-service
  - scikit-image
  - scikit-learn
  - requests
  - tensorflow-gpu
  - ipykernel

The key was using only the defaults channel. If you include conda-forge in the channel list it pulls in the latest cudatoolkit package that doesn't work with the tensorflow-gpu from defaults. (tensorflow-gpu is not currently available on conda-forge).

I activated the environment: conda activate tensorflow-gpu and then followed the instructions for setting up Pangeo on HPC.

I then started an interactive session on the gpu partition, which spins up an instance to support the request, and then starts a terminal there:

srun --nodes=1 --partition=gpu --ntasks-per-node=1 --time=01:00:00 --pty bash -i

Then I ran a start_jupyter script (~/bin/start_jupyter), which looks like this:

cd /shared
JPORT=$(shuf -i 8400-9400 -n 1)
echo ""
echo ""
echo "Step 1: Wait until this script says the Jupyter server"
echo "        has started. "
echo ""
echo "Step 2: Copy this ssh command into a terminal on your"
echo "        local computer:"
echo ""
echo "        ssh -N -i $HOME/.ssh/AWS-HPC-Ohio -L 8889:`hostname`:$JPORT $USER@ec2-3-129-67-246.us-east-2.compute.amazonaws.com"
echo ""
echo "Step 3: Browse to https://localhost:8889 on your local computer"
echo ""
echo ""
sleep 2
jupyter lab --no-browser --ip=`hostname` --port=$JPORT

Then on my local windows machine, I opened up a git bash terminal and just paste in the ssh forwarding link echoed above.

Then I opened up localhost:8889 in my local browser and ran the notebook on the remote GPU instance!

Note: I discovered which version of cuda-toolkit to specify by doing:

$ nvidia-smi | grep NVIDIA-SMI    
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |

which told me I should not specify any cudatoolkit greater than 11.

Here's the proof that it worked: https://nbviewer.jupyter.org/gist/rsignell-usgs/1e1a7f3ae3483725dd8f78f4d02c023a

cc: @csherwood-usgs

@rsignell-usgs rsignell-usgs changed the title Got this working on AWS Running on AWS GPU instances Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant