You can adapt any example to use on premises machines and NVIDIA NIM microservices. By performing the additional prerequisites that are required to get access to the containers and use GPUs with Docker, you can use local machines with GPUs and local microservices instead of NVIDIA API Catalog endpoints.
-
You have an active subscription to an NVIDIA AI Enterprise product or you are an NVIDIA Developer Program member.
-
Complete the common prerequisites.
Ensure that you configure the host with the NVIDIA Container Toolkit.
-
A host with at least two NVIDIA A100, H100, or L40S GPUs.
You need at least one GPU for the inference container and one GPU for the embedding container. By default, Milvus requires one GPU as well.
-
You have an NGC API key. Refer to Generating NGC API Keys in the NVIDIA NGC User Guide for more information.
-
Export NGC related environment variables:
export NGC_API_KEY="M2..."
-
Create a directory to cache the models and export the path to the cache as an environment variable:
mkdir -p ~/.cache/model-cache export MODEL_DIRECTORY=~/.cache/model-cache
-
Export the connection information for the inference and embedding services:
export APP_LLM_SERVERURL="nemollm-inference:8000" export APP_EMBEDDINGS_SERVERURL="nemollm-embedding:8000"
-
Start the example-specific containers.
Replace the path in the following
cd
command with the path to the example that you want to run.cd RAG/examples/basic_rag/langchain USERID=$(id -u) docker compose --profile local-nim --profile milvus up -d --build
Example Output
✔ Container milvus-minio Running ✔ Container chain-server Running ✔ Container nemo-retriever-embedding-microservice Started ✔ Container milvus-etcd Running ✔ Container nemollm-inference-microservice Started ✔ Container rag-playground Started ✔ Container milvus-standalone Started
-
Optional: Deploy Reranking service if needed by your example. This is required currently for only the Multi-Turn Rag Example.
export APP_RANKING_SERVERURL="ranking-ms:8000" cd RAG/examples/local_deploy USERID=$(id -u) docker compose -f docker-compose-nim-ms.yaml up -d ranking-ms
-
Open a web browser and access http://localhost:8090 to use the RAG Playground.
Refer to Using the Sample Web Application for information about uploading documents and using the web interface.
When you start the microservices in the local_deploy
directory, you can specify the GPUs use by setting the following environment variables before you run docker compose up
.
INFERENCE_GPU_COUNT: Specify the number of GPUs to use with the NVIDIA NIM for LLMs container.
EMBEDDING_MS_GPU_ID: Specify the GPU IDs to use with the NVIDIA NeMo Retriever Text Embedding NIM container.
RANKING_MS_GPU_ID: Specify the GPU IDs to use with the NVIDIA NeMo Retriever Text Reranking NIM container.
VECTORSTORE_GPU_DEVICE_ID: Specify the GPU IDs to use with Milvus.
The preceding document frequently demonstrates using the curl command to interact with the microservices.
You can determine the IP address for each container by running docker network inspect nvidia-rag | jq '.[].Containers[] | {Name, IPv4Address}'
.
- Vector Database Customizations
- Stop the containers by running
docker compose --profile local-nim down
.