Skip to content

Latest commit

 

History

History
executable file
·
368 lines (255 loc) · 14.9 KB

README.md

File metadata and controls

executable file
·
368 lines (255 loc) · 14.9 KB

ICCluster at CVLab

If you find a mistake, something is not working, you know a better way to do it, or you need a new image to be built, please let me know or open an issue here. - Kris

Quick start with RunAI

Install RunAI CLI

Things to download and put in $PATH:

Make sure the binaries have permission to execute (e.g. chmod +x some/place/runai). More on the CLI installation: https://docs.run.ai/Administrator/Researcher-Setup/cli-install/

Login

Quick-start scripts

We have scripts for launching jobs with sensible defaults which should serve you for most use cases.

Script setup

  • Download scripts: runai_one.sh, runai_interactive.sh

  • Edit the scripts to fill in CLUSTER_USER, CLUSTER_USER_ID values for your EPFL cluster account, and MY_WORK_DIR if you want to change the directory where the job runs.

Batch job

Submit jobs running a command with runai_one.sh. These jobs have training priority.

  • bash runai_one.sh job_name num_gpu "command"

  • bash runai_one.sh name-hello-1 1 "python hello.py"
    creates a job named name-hello-1, uses 1 GPU, enters MY_WORK_DIR directory and runs python hello.py

  • bash runai_one.sh name-hello-2 0.5 "python hello_half.py"
    creates a job named name-hello-2, receives half of a GPUs memory (2 such jobs can fit on one GPU!), enters MY_WORK_DIR directory and runs python hello_half.py

Interactive session

Submit an interactive job with bash runai_interactive.sh, the job will be named yourname-inter and has interactive priority, uses 0.5 GPU (customizable), starts a jupyter server at port 8888 with default password hello, runs for 8 hours.

  • Connect to the jupyter server: kubectl port-forward yourname-inter-0-0 8888:8888, open localhost:8888, default password is hello.
  • Connect in the console: runai bash yourname-inter.
  • Once the interactive job has finished, delete it to make starting a new one possible: runai delete yourname-inter

Remote work with vscode

There is a separate tutorial on setting vscode up to work directly on the running node, allowing for easy (and GPU-accelerated) execution and debugging.

Detailed job management

runai_project="cvlab-$CLUSTER_USER" # per-user runai projects now

runai submit $arg_job_name \
	-i $MY_IMAGE \
	--gpu $arg_gpu \
	--pvc runai-$runai_project-cvlabdata1:/cvlabdata1 \
	--pvc runai-$runai_project-cvlabdata2:/cvlabdata2 \
	--pvc runai-$runai_project-cvlabsrc1:/cvlabsrc1 \
	--large-shm \
	-e CLUSTER_USER=$CLUSTER_USER \
	-e CLUSTER_USER_ID=$CLUSTER_USER_ID \
	-e CLUSTER_GROUP_NAME=$CLUSTER_GROUP_NAME \
	-e CLUSTER_GROUP_ID=$CLUSTER_GROUP_ID \
	-e TORCH_HOME="/cvlabsrc1/cvlab/pytorch_model_zoo" \
	--command -- /opt/lab/setup_and_run_command.sh "cd $MY_WORK_DIR && $arg_cmd"

Choice of docker images: The mechanism which sets up the user/group will not work on docker images built from scratch, because it uses these setup scripts. The details of our images are in the images section of this repository. You are welcome to use these images or build upon them. For direct use I recommend ic-registry.epfl.ch/cvlab/lis/lab-python-ml:cuda10 as it has fairly modern versions of various scientific libraries.

Volume mounts: The default volume mounts in the script are for CVLAB (cvlabdata volumes). Please change them if you are in a different lab.

Training vs interactive: By default jobs are training mode, which means they can use GPUs beyond the lab's quota of 28, but can be stopped and restarted (so its worth checkpointing etc). Jobs can be made interactive (non-preemptible) with the --interactive option of runai submit, but they are stopped after 12 hours, and there is a limited number of those allowed in the lab, so please do not create too many simultaneously.

Manage and connect

  • List jobs in the lab: runai list jobs

  • Find out the status of your job runai describe job jobname

  • Stop running jobs with runai delete job jobname. Also if you want to submit another job with the same name, you need to delete the existing one which occupies the name.

  • View logs runai logs jobname. Add --tail 64 to see 64 latest lines (or other number)

  • Run an interactive console inside the container runai bash jobname.

  • Forward ports between the container and your machine, for example for jupyter: kubectl port-forward jobname-0-0 8888:8888

Asking the admins for help

The cluster machines sometimes get stuck and need to be restarted, or there are bugs in RunAI. In these cases, we need to ask the ICIT admins for help. To localize the problem, they need good diagnostic information from you.

The detailed procedure can be found here. Here is the copy of this procedure, so that you may view it outside of the EPFL network:

To open a ticket, please send an email to [email protected].

  • Chose an explicit subject
  • qualify your ticket by providing all the information useful to resolve your issue
  • attach your yaml file or the runai command used to start your job
  • attach job/pod's log information (replace <lab> by your lab name)
    • find your job/pod:
    $ runai list job -p <lab>
    $ kubectl get pods -n runai-<lab>
    
    • get job/pod's description
    $ runai describe job <job name> -p <lab>
    $ kubectl describe pod <pod name> -n runai-<lab>
    
    • get job/pod's log
    $ runai logs pod name> -p <lab>
    $ kubectl logs <pod name> -n runai-<lab>
    
  • provide others log messages you can have

Overview

Docker containers are the processses running on a docker host (that is our server). They use the same operating system as the host, but have their own internal file system and do no see the host's file system.

Images are snapshots of that internal file system. For example we installed our libraries in a container and take a snapshot so that we can start new containers from the same base. Images can be made by saving a given container's file system, but usually are specified declaratively with Dockerfiles.

Kubernetes is a system that organizes running a big number of docker containers on a multi-machine cluster. The rationale is that Kubernetes will allocate resources when we need to run a job and release them later, leading to a more efficient usage than when machines are assigned to people - we do not pay for the resources when the jobs are not running.

Pre-built images

I made some base images that should be useful to everyone. It should be easy to start using those, without having to build custom images. The user account setup is done through environment variables, so you do not have to place it in your Dockerfile.

ic-registry.epfl.ch/cvlab/lis/lab-python-ml:cuda11 contains CUDA, PyTorch, Tensorflow, OpenCV, GluonCV, Detectron2, PyTorch3D as well as other commonly used packages. If you need more, you can extend this and build your own image on top (Dockerfile FROM) or let me know that something needs adding.

ic-registry.epfl.ch/cvlab/lis/lab-pytorch:cuda11 - smaller image without TF or Gluon.

ic-registry.epfl.ch/cvlab/lis/lab-base:cpu is the base with just user account setup for cvlabdata mounting, the :cuda10 version additionally has CUDA installed.

More about images here.

The GPUs we have at the cluster work faster with half-precision training.

External storage

By default the container only has access to its internal file system. To read or save some data, we will mount the cvlabdata drives.

This is achieved by adding this to your pod configuration (pod is the top-level object):

  volumes:
    - name: cvlabsrc1
      persistentVolumeClaim:
        claimName: runai-cvlab-yourname-cvlabsrc1
    - name: cvlabdata1
      persistentVolumeClaim:
        claimName: runai-cvlab-yourname-cvlabdata1
    - name: cvlabdata2
      persistentVolumeClaim:
        claimName: runai-cvlab-yourname-cvlabdata2

and this to each of your containers:

      volumeMounts:
        - mountPath: /cvlabsrc1
          name: cvlabsrc1
        - mountPath: /cvlabdata1
          name: cvlabdata1
        - mountPath: /cvlabdata2
          name: cvlabdata2

CVLabData write permissions

To have write permissions to cvlabdata, we need to present our user IDs from the cluster. Run the id command on iccluster, you should get something like this:

uid=123456(youruser) gid=11166(CVLAB-unit) groups=....

Copy the number from uid=... and put it into the pod configuration file:

      env:
      - name: CLUSTER_USER
        value: "username" # set this
      - name: CLUSTER_USER_ID
        value: "123456" # set this
      - name: CLUSTER_GROUP_NAME
        value: "CVLAB-unit"
      - name: CLUSTER_GROUP_ID
        value: "11166"

In my base containers, these variables are used to setup the user account with the following script when the container start up.

The images which have this feature so far are:

  • ic-registry.epfl.ch/cvlab/lis/lab-base:cpu
  • ic-registry.epfl.ch/cvlab/lis/lab-pytorch-extra:py38src
  • ic-registry.epfl.ch/cvlab/lis/lab-python-ml:py38src
  • and anything built on top of those

Startup Command

The command field specifies the program to run when the container starts. Also when this command finishes, the container will shut down.

For example running a python program:

command: ["python", "some_program.py", "--option", "val"]

In the premade images with user setup:

# run a python job
command:
  - "/opt/lab/setup_and_run_command.sh"
  - "cd /cvlabdata2/home/lis/kubernetes_example && python job_example.py"
# start a jupyter server
command:
  - "/opt/lab/setup_and_run_command.sh"
  - "timeout 4h jupyter lab --ip=0.0.0.0 --no-browser --notebook-dir=/cvlabdata2/home/lis/kubernetes_example"
  # Timeout will ensure the pod closes after some time,
  # so we don't risk leaving it running forever.

You can run those examples in /cvlabdata2/home/lis/kubernetes_example, I will clear it out periodically.

Timeout

If a process does not finish by itself, I recommend limiting its lifetime with timeout. The following command will automatically shut down Jupyter after 4 hours:

timeout 4h jupyter lab --ip=0.0.0.0 --no-browser --notebook-dir=/cvlabdata2/home/lis/kubernetes_example"

Connecting an interactive console to the container

Once a pod is running, we can connect to it and run commands inside:

kubectl exec -it pod_name -- /bin/bash

This will be executed as the root user, so switch to your user which can write on cvlab drives:

su youruser -c /bin/bash

This can be combined into a single convenient command:

kubectl exec -it pod-name -- bash -c "su youtuser -c tmux"

Diagnosing problems

If the job is not running as intended, you can see its status:

kubectl describe pod/pod_name

and check for errors by viewing the the output of your process:

kubectl logs pod_name

Running multiple experiments in one container

The GPUs in the Kubernetes cluster usually have 32GB of memory, so compared to the previous 12GB GPUs, they should be capable of running 2 or 3 experiments of usual size at once.

The script below shows a simple way to run several experiments at once. The commands will run in parallel, the container will finish when the last one finishes.

# my_job.sh
python task_1.py &
python task_2.py &
bash task_3.sh &
# the jobs will run in parallel
# the container will finish when the last one finishes
wait

Network communication - port forwarding

See the example pod configuration for jupyter. To connect to our container over the network, first we need to expose the ports in our container configuration:

ports:
- containerPort: 8888
  name: jupyter

One the container with exposed ports is running, we will make a tunnel from our local computer's port to the container's port (Kubernetes port forwarding).

kubectl port-forward mypod local_port:container_port

For example for jupyter:

kubectl port-forward lis-example 8001:8888

Then we can open jupyter at localhost:8001. The password in the example config is hello.

To shut down jupyter (and the container with it) from the web interface:

  • JupyterLab: select File -> Quit from the menu in the top-left
  • Jupyter Notebook: press the Quit button in the top right

Jupyter will run forever if we do not close it. Therefore I recommend limiting it with timeout. The following command will automatically shut down Jupyter after 4 hours:

timeout 4h jupyter lab --ip=0.0.0.0 --no-browser --notebook-dir=/cvlabdata2/home/lis/kubernetes_example"

Alternatively a load balancer can be used to make the container accessible through the network.