- Time: October 30, 2024.
- 11.30 - 12.30 : LLM Inference Optimizations
- Location: TCS 1416
- Slack Channel: track-2 : Use to post questions, exact error messages etc.
- Get interactive node
qsub -I -l select=1:ngpus=4 -l filesystems=home:eagle:grand -l walltime=1:00:00 -l -q HandsOnHPC -A alcf_training
- Clone repo and activate module
$ git clone https://github.com/argonne-lcf/ALCF_Hands_on_HPC_Workshop.git $ cd ALCF_Hands_on_HPC_Workshop/InferenceOptimizations $ module use /soft/modulefiles $ module load conda/2024-10-30-workshop $ conda activate
We will use LLAMA3-8B model to run inference hands-on examples.
-
Inference with Huggingface
$ bash run_HF.sh
This script will run
run_HF.py
script with correct command line flags. -
Inference with vLLM
$ bash run_vllm.sh
This script will run
run_vllm.py
script with correct command line flags. -
vLLM Quantization Example
$ bash run_vllm_quant.sh
This script will run
run_vllm.py
script with correct command line flags. -
vLLM SD Example
$ bash run_vllm_SD.sh
This script will run
run_vllm.py
script with correct command line flags.
Contributors: Krishna Teja Chitty-Venkata and Siddhisanket (Sid) Raskar.
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.