The NeurIPS 2023 Efficiency Challenge is a competition focused on training 1 LLM for 24 hours on 1 GPU – the team with the best LLM gets to present their results at NeurIPS 2023.
This quick start guide is a short starter guide illustrating the main steps to get started with Lit-GPT, which was selected as the competition's official starter kit.
Permitted GPUs:
- 1x A100 (40 GB RAM);
- 1x RTX 4090 (24 GB RAM).
Permitted models:
- All transformer-based LLM base models that are not finetuned yet.
The subset of Lit-GPT models supported in this competition is listed in the table below. These don't include models that have been finetuned or otherwise aligned, as per the rules of the challenge.
Models in Lit-GPT | Reference |
---|---|
Meta AI Llama 2 Base | Touvron et al. 2023 |
TII UAE Falcon Base | TII 2023 |
OpenLM Research OpenLLaMA | Geng & Liu 2023 |
EleutherAI Pythia | Biderman et al. 2023 |
StabilityAI StableLM Base | Stability AI 2023 |
Permitted datasets
Any open-source dataset is allowed. Originally, per competition rules, datasets that utilize "generated content" from other LLMs were not permitted. However, the rules were recently softened to also allow LLM-generated datasets if those datasets are made available and if it is not against the usage restrictions and guidelines of the LLM. If you plan to use a specific dataset that is not explicitely listed on the challenge website or want to use LLM-generated data, it is recommended to reach out to the organizers and confirm that this is in line with the competition rules.
Examples of permitted datasets are the following:
You are allowed to create your own datasets if they are made publicly accessible under an open-source license, and they are not generated from other LLMs (even open-source ones).
Helpful competition rules relevant to the dataset choice:
- The maximum prompt/completion length the models are expected to handle is 2048 tokens.
- The evaluation will be on English texts only.
Submission deadline
- October 25, 2023 (Please check official website in case of updates.)
Use the following steps to set up the Lit-GPT repository on your machine.
git clone https://github.com/Lightning-AI/lit-gpt
cd lit-gpt
pip install -r requirements.txt tokenizers sentencepiece huggingface_hub
This section explains how to download the StableLM 3B Base model, one of the smallest models supported in Lit-GPT (an even smaller, supported model is Pythia, which starts at 70M parameters). The downloaded and converted checkpoints will occupy approximately 28 Gb of disk space.
python scripts/download.py \
--repo_id stabilityai/stablelm-base-alpha-3b
python scripts/convert_hf_checkpoint.py \
--checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b
While StableLM 3B Base is useful as a first starter model to set things up, you may want to use the more capable Falcon 7B or Llama 2 7B/13B models later. See the download_*
tutorials in Lit-GPT to download other model checkpoints.
After downloading and converting the model checkpoint, you can test the model via the following command:
python generate/base.py \
--prompt "LLM efficiency competitions are fun, because" \
--checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b
The following command will download and preprocess the Dolly15k dataset for the StableLM 3B Base model:
python scripts/prepare_dolly.py \
--checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b \
--destination_path data/dolly-stablelm3b
Note
The preprocessed dataset is specific to the StableLM 3B model. If you use a different model like Falcon or Llama 2 later, you'll need to process the dataset with that model checkpoint directory. This is because each model uses a different tokenizer.
Low-rank Adaptation (LoRA) is a good choice for a first finetuning run. The Dolly dataset has ~15k samples, and the finetuning might take half an hour.
To accelerate this for testing purposes, edit the ./finetune/lora.py script and change max_iters = 50000
to max_iters = 500
at the top of the file.
Note
The Dolly dataset has a relatively long context length, which could result in out-of-memory issues. The maximum context length that is used for the evaluation, according to the official competition rules, is 2,048 tokens. Hence, it's highly recommended to prepare the dataset with a fixed max length, for example, python scripts/prepare_dolly.py --max_seq_length 2048
.
The following command finetunes the model:
CUDA_VISIBLE_DEVICES=2 python finetune/lora.py \
--data_dir data/dolly-stablelm3b \
--checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b" \
--out_dir "out/stablelm3b/dolly/lora/experiment1" \
--precision "bf16-true"
With 500 iterations, this takes approximately 1-2 min on an A100 and uses 26.30 GB GPU memory.
If you are using an RTX 4090, change micro_batch_size=4
to micro_batch_size=1
so that the model will only use 12.01 GB of memory.
(More finetuning settings are explained here.)
The official Lit-GPT competition will use a small subset of HELM tasks for model evaluation, which includes BigBench (general), MMLU (knowledge), TruthfulQA (knowledge and harm in a multiple choice format), CNN/DailyMail (news summarization), GSM8K (math), and BBQ (bias).
HELM is currently also being integrated into Lit-GPT to evaluate LLMs before submission.
However, a tool with a more convenient interface is Eleuther AI's Evaluation Harness, which contains some tasks, for example, BigBench, TruthfulQA, and GSM8k, that overlap with HELM. We can set up the Evaluation Harness as follows:
pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@master
And then we can use it via the following command:
python eval/lm_eval_harness.py \
--checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b" \
--precision "bf16-true" \
--eval_tasks "[truthfulqa_mc,gsm8k]" \
--batch_size 4 \
--save_filepath "results-stablelm-3b.json"
(You can find a full task list in the task table here.)
To evaluate a LoRA-finetuned model, you need to first merge the LoRA weights with the base model to create a new checkpoint file:
python scripts/merge_lora.py \
--checkpoint_dir "checkpoints/stabilityai/stablelm-base-alpha-3b/" \
--lora_path "out/stablelm3b/dolly/lora/experiment1/lit_model_lora_finetuned.pth" \
--out_dir "out/lora_merged/stablelm-base-alpha-3b/"
cp checkpoints/stabilityai/stablelm-base-alpha-3b/*.json \
out/lora_merged/stablelm-base-alpha-3b/
For more information on LoRA weight merging, please see the Merging LoRA Weights section of the LoRA finetuning documentation.
After merging the weights, we can use the lm_eval_harness.py
similar to before with the only difference that we now use the new
checkpoint folder containing the merged LoRA model:
python eval/lm_eval_harness.py \
--checkpoint_dir "out/lora_merged/stablelm-base-alpha-3b" \
--precision "bf16-true" \
--eval_tasks "[truthfulqa_mc,gsm8k]" \
--batch_size 4 \
--save_filepath "results-stablelm-3b.json"
You will be required to submit a Docker image for the submission itself. Fortunately, the organizers have a GitHub repository with the exact steps here and a toy-submission setup guide to test your model locally before submission.
- The official NeurIPS 2023 LLM Efficiency Challenge competition website
- A more extensive guide, including environment setup tips: The NeurIPS 2023 LLM Efficiency Challenge Starter Guide
- Official competition Discord and Lightning AI + Lit-GPT Discord
- LoRA vs Adapter vs Adapter v2 comparison in Lit-GPT using Falcon 7B: Finetuning Falcon LLMs More Efficiently With LoRA and Adapters
- Dealing with out-of-memory (OOM) errors in Lit-GPT
- Introduction to Fabric (an API to access more advanced PyTorch features used in Lit-GPT) and memory saving tips: Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch