-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard #53
Comments
Hi @renmengjie7, Would you mind sharing the reproduced results (e.g., full/hard, complete/instruct)? IMO the main difference would be: (1) batch inference (could be ~5% on the hard subset), which has been mentioned in README, (2) Cheers |
This is a jsonl file |
Would you mind sharing the inference results, sanitized results and execution results of StarCoder2-15B-Instruct-v0.1 in the leaderboard ? I want know which step is unaligned with yours. |
Thanks, so what's the final score you got after the evaluation? For the calibrated outputs, please refer to this section. Please note that it was done a few months ago, which was done without batch inference. |
BTW, based on the share output format, it seems that you were not using the latest |
Thanks! That's the BigCodeBench dataset version, but not the PyPI version. Recent models were evaluated on v0.1.2 dataset. You can change the dataset version here. |
So, for StarCoder2-15B-Instruct-v0.1, the leaderboard used dataset v0.1.2 ? And bigcodebench PyPI==v0.2.0 is compatible with the leardboard when set inference_batch=1 ? |
It was evaluated on the v0.1.0 dataset.
Ideally, the scores should be still close when using batch size as 1. I mainly want to know the scores you got on the specific subset and split, to understand how significant the difference is. |
I will try it again on the v0.1.0 dataset. Thank you ! |
What is the difference between dataset v0.1.0 and v0.1.2 ? |
We refined several task descriptions and test cases, which was documented under this folder.
Please make sure that you use the default Gradio endpoint for evaluation. That HF space environment should be very similar to the one we originally used. |
What is the evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard
I run evaluation but got different performance. Any suggestions or insights ?
Will the repository release the inference results and execution results of the models in the leaderboard ? This will help reproduction. 🙏🙏
The text was updated successfully, but these errors were encountered: