A set of common deep learning models that can be deployed on a Kubernetes cluster with GPU support.
-
Create the
env-values.yaml
file:cp env-values.example.yaml env-values.yaml
Now, open the
env-values.yaml
file and fill in the values. -
Connect to your k8s cluster. This will change depending on the cloud provider.
Eg for azure:
# Set the cluster subscription az account set --subscription <subscription-id> # Download cluster credentials az aks get-credentials --resource-group <resource-group> --name <vm-name> # Set the namespace kubectl config set-context --current --namespace=gpu
-
Deploy the helm chart
helm install gooey-gpu-1 chart -f chart/model-values.yaml -f env-values.yaml
-
Check the status of the deployment
kubectl get pods -n gpu
gooey-gpu includes a standard Dockerfile with common deep learning dependencies like cuda, diffusers & transformers pre-installed.
gooey-gpu also provides a small python helper library to make it easy to write celery tasks.
-
If the dependencies are not enough, you can add more dependencies to the Dockerfile & requirements.txt file.
-
Create a new file in the
common/
folder that imports the model and defines the load function.## common/my_model.py from celeryconfig import setup_queues from functools import lru_cache import os @lru_cache # this will cache the model in memory and use it across calls def load_model(model_id: str): ... setup_queues( model_ids=os.environ["MY_MODEL_IDS"].split(), # get the model ids from the env load_fn=load_model, # this tells the celery worker to load the model when starting )
To load custom models, you can use the provided cache directory. The helm chart includes a nfs provisioner that mounts a shared directory across all the pods. You can use this directory to store the models.
## common/my_model.py import torch import os import gooey_gpu from functools import lru_cache @lru_cache def load_model(model_id: str): model_path = os.path.join(gooey_gpu.CHECKPOINTS_DIR, model_id) if not os.path.exists(model_path): ... # download the model from huggingface or any other source return torch.load(model_path).to(gooey_gpu.DEVICE_ID)
You can also
kubectl exec
into a running pod a manually copy the model files to the shared directory.kubectl exec -it <pod-name> -n gpu -- bash cd /root/.cache/gooey-gpu/checkpoints ... # copy the model files here
-
Define the model inference params and code
## common/my_model.py import gooey_gpu from celeryconfig import app, setup_queues from pydantic import BaseModel class MyModelPipeline(BaseModel): model_id: str ... class MyModelInputs(BaseModel): text: str ... class MyModelOutput(BaseModel): image: str ... @app.task(name="my_model_task") @gooey_gpu.endpoint def my_model_task( pipeline: MyModelPipeline, inputs: MyModelInputs, ) -> MyModelOutput: model = load_model(pipeline.model_id) ... # write the inference code here return MyModelOutput(...)
-
Install rabbitmq, redis & docker on the machine with GPUs.
-
Set the Hugging Face Hub Token
echo "export HUGGING_FACE_HUB_TOKEN=hf_XXXX" >> ~/.bashrc
-
Add your new model ids as env vars in
scripts/run-dev.sh
## scripts/run-dev.sh docker run \ -e MY_MODEL_IDS=" model_id_1 model_id_2 " \ ...
-
Run the development script
./scripts/run-dev.sh common common.my_model
-
Test the model by sending a request to the celery worker.
If you are using is a vm, Port forward the redis & rabbitmq ports from the vm to your local machine.
Eg for azure:
az ssh config --name <vm-name> --resource-group <resource-group> --file ~/.ssh/config.d/<vm-name> --overwrite ssh <vm-name> -vN -L 6374:localhost:6379 -L 5674:localhost:5672
Note how we use ports
6374
and5674
. This is to avoid potential conflicts with a local redis and rabbitmq.from celery import Celery app = Celery() app.conf.broker_url = "amqp://localhost:<port>" app.conf.result_backend = "redis://localhost:<port>" app.conf.result_extended = True result = app.send_task( "my_model_task", kwargs=dict( pipeline=dict(model_id="model_id_1"), inputs=dict(text="Hello, World!"), ), queue="gooey-gpu/model_id_1", # by default the queue name is gooey-gpu/<model_id> ) print(result.get()) # { "image": "..." }
-
During this, try to record the GPU usage using
nvitop
(ornvidia-smi
)This will come handy to define the resource limits in the helm chart. Once you have this number, you need to convert this to the equivalent CPU memory limit.
To do this, you can use the following formula:
cpu_memory_limit = (cpu_memory_capacity / gpu_memory_capacity) * gpu_memory_limit
E.g. for an azure Standard_NC24ads_A100_v4 with 216 Gib CPU memory and 80 Gib GPU memory, and a diffusion model with a max GPU memory usage of 7 Gib, the CPU memory limit would be:
(216 / 80) * 7 ~= 20Gi
This helps us put multiple models in the same GPU and avoid CUDA OOM errors.
-
Once you are confident that the model is working as expected, upload the docker image to a container registry.
docker tag gooey-gpu-common <registry>/<image>:<tag> docker push <registry>/<image>:<tag>
You might need to login to your container registry before pushing the image. Eg for azure:
az acr login --name <registry>
-
Update the
model-values.yaml
file with the new image (or create a new file)
-
Under
rabbitmqAutoscaling
, add the new env var namerabbitmqAutoscaling: queueNameVars: - MY_MODEL_IDS
-
Under
deployments
, add your new model.deployments: - name: "common-my-model" image: "<registry>/<image>:<tag>" limits: memory: "20Gi" cpu: 1 env: IMPORTS: |- my_model MODEL_IDS: |- model_id_1 model_id_2
- Deploy the helm chart
- Check the diff to make sure the new model is being added
helm diff upgrade gooey-gpu-1 chart -f chart/model-values.yaml -f env-values.yaml
- Deploy the chart
helm upgrade gooey-gpu-1 chart -f chart/model-values.yaml -f env-values.yaml
We have a retro/
folder that contains dependencies for older projects using CUDA 11.6 & PyTorch v1
The recommended way to turn a set of research scripts like Wav2Lip into a deployable container are:
-
Add the project as a submodule in the
retro/
folder.cd retro/ git submodule add <project-url>
-
Create a python module in the
retro/
folder that imports the project's modules.If you're lucky and the project has an installable package, you can just add the package to the
requirements.txt
file.Otherwise, you can create a python module that imports the project's modules:
## retro/my_model.py import sys sys.path.append(os.path.join(os.path.dirname(__file__), "<project-dir>"))
-
Copy the code from the project's inference script, and make a celery task out of it.
If you're lucky the project has a separate load() and predict() function. In that case you can just call the load() fn from
setup_queues()
, and the predict() fn from the celery task.Otherwise, you will have to decode the inference script and write separate functions for loading the model and running inference.
-
Once you have these written, you can follow the same steps as the common models to deploy the model.
Gitleaks will automatically run pre-commit (see pre-commit-config.yaml
for details) to prevent commits with secrets in the first place. To test this without committing, run pre-commit
from the terminal. To skip this check, use SKIP=gitleaks git commit -m "message"
to commit changes. Preferably, label false positives with the #gitleaks:allow
comment instead of skipping the check.
Gitleaks will also run in the CI pipeline as a GitHub action on push and pull request (can also be manually triggered in the actions tab on GitHub). To update the baseline of ignored secrets, run python ./scripts/create_gitleaks_baseline.py
from the venv and commit the changes to .gitleaksignore
.