Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce distillgpt2 LM Numbers using --knn #14

Open
HossamAmer12 opened this issue Oct 17, 2024 · 5 comments
Open

Cannot reproduce distillgpt2 LM Numbers using --knn #14

HossamAmer12 opened this issue Oct 17, 2024 · 5 comments

Comments

@HossamAmer12
Copy link

HossamAmer12 commented Oct 17, 2024

I am trying to build on your knn-transfomers repo.

When I run the distill gpt with the given setup in the repo but with --knn flag, I get around 21.xx preplexity. This number is different than the one reported in the repository.

MODEL=neulab/distilgpt2-finetuned-wikitext103
python -u run_clm.py \
  --model_name_or_path ${MODEL} \
  --dataset_name wikitext --dataset_config_name wikitext-103-raw-v1 \
  --output_dir checkpoints/${MODEL}_knn \
  --do_eval --eval_subset validation \
  --dstore_dir /tmp/distillgpt2/ --dstore_size 116988150

I am able to reproduce the other numbers (baseline + retomaton) for distill gpt.

Could you please let me know if you have any clue here?

@urialon
Copy link
Collaborator

urialon commented Oct 17, 2024 via email

@HossamAmer12
Copy link
Author

Thanks @urialon for getting back.

The model that I was using in the previous (sorry I edited my post above) is the one given in the repo. That said, the scores are different.

Based on your suggestion, I tried building the dstore myself. everytime I get to this error:
UserScriptFilledDisk: User script filled the disk. Consider using Virtual Machine SKU with larger disk size.

That's the command I used for building the datastore:

MODEL=neulab/distilgpt2-finetuned-wikitext103
path_to=""

CUDA_VISIBLE_DEVICES=0 python -u run_clm.py \
  --model_name_or_path ${MODEL} \
  --dataset_name wikitext --dataset_config_name wikitext-103-raw-v1 \
  --do_eval --eval_subset train \
  --output_dir $path_to/checkpoints/${MODEL} \
  --dstore_dir $path_to/checkpoints/${MODEL} \
  --save_knnlm_dstore --dstore_size 116988150

Does it require too much size?

Question: Do I have to specify the dstore size here? What does the dstore size indicate? Number of contexts?
Another question. When running knn-lm + given distill gpt, should I use a specific temperature or lambda? I saw you posting on this so that we reproduce the scores.

@HossamAmer12
Copy link
Author

HossamAmer12 commented Oct 18, 2024

Just want to update on the issue. Using the following did not result into the size issue:

MODEL=neulab/distilgpt2-finetuned-wikitext103
CUDA_VISIBLE_DEVICES=0 python -u run_clm.py \
  --model_name_or_path ${MODEL} \
  --dataset_name wikitext --dataset_config_name wikitext-103-raw-v1 \
  --do_eval --eval_subset validation \
  --output_dir ${path}/checkpoints/${MODEL}\_SAVE0 \
  --dstore_dir ${path}/checkpoints/${MODEL}\_SAVE0 \
  --save_knnlm_dstore --dstore_size 116988150

I guess that's due to the small size of the validation split (I know that's not realistic setup). Do you know the size of the training set and what our limits are?

@HossamAmer12
Copy link
Author

Hi Uri,

I tried to construct the datastore with the wikitext validation set and given distill gpt model. Then run knn using the same set. The final perplexity scores are not good relative to baseline.

What could be the problem?

Even though the setup is not practical, I expected that the perplexity to be a lot better given the datastore set and eval set are the same.

That of course not being able to using the training set for knn datastore due to memory problems. I have not yet figured out the reason.

I kindly ask for your helpful advice.

Thanks,
Hossam

@HossamAmer12 HossamAmer12 changed the title Cannot reproduce distillgpt2 LM Numbers Cannot reproduce distillgpt2 LM Numbers using --knn Oct 21, 2024
@urialon
Copy link
Collaborator

urialon commented Oct 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants