Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard #53

Open
renmengjie7 opened this issue Oct 23, 2024 · 12 comments
Open
Assignees
Labels
question Further information is requested

Comments

@renmengjie7
Copy link

What is the evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard
I run evaluation but got different performance. Any suggestions or insights ?

Will the repository release the inference results and execution results of the models in the leaderboard ? This will help reproduction. 🙏🙏

@terryyz
Copy link
Collaborator

terryyz commented Oct 23, 2024

Hi @renmengjie7,

Would you mind sharing the reproduced results (e.g., full/hard, complete/instruct)?

IMO the main difference would be: (1) batch inference (could be ~5% on the hard subset), which has been mentioned in README, (2) --strip_newlines as mentioned in the ADVANCED_USAGE (I forgot if I used for the self-instruct StarCoder2), and (3) some updated task descriptions.

Cheers

@renmengjie7
Copy link
Author

renmengjie7 commented Oct 24, 2024

inference.json

This is a jsonl file

@renmengjie7
Copy link
Author

renmengjie7 commented Oct 24, 2024

Would you mind sharing the inference results, sanitized results and execution results of StarCoder2-15B-Instruct-v0.1 in the leaderboard ?

I want know which step is unaligned with yours.

@terryyz
Copy link
Collaborator

terryyz commented Oct 24, 2024

Thanks, so what's the final score you got after the evaluation?

For the calibrated outputs, please refer to this section. Please note that it was done a few months ago, which was done without batch inference.

@terryyz
Copy link
Collaborator

terryyz commented Oct 24, 2024

BTW, based on the share output format, it seems that you were not using the latest bigcodebench version..?

@renmengjie7
Copy link
Author

yes, I used version 2 before

But I noticed that the notes on the leaderboard said it used version 1, so i changed.

截屏2024-10-24 下午4 36 17

@terryyz
Copy link
Collaborator

terryyz commented Oct 24, 2024

Thanks! That's the BigCodeBench dataset version, but not the PyPI version. Recent models were evaluated on v0.1.2 dataset. You can change the dataset version here.

@terryyz terryyz self-assigned this Oct 24, 2024
@terryyz terryyz added the question Further information is requested label Oct 24, 2024
@renmengjie7
Copy link
Author

So, for StarCoder2-15B-Instruct-v0.1, the leaderboard used dataset v0.1.2 ?

And bigcodebench PyPI==v0.2.0 is compatible with the leardboard when set inference_batch=1 ?

@terryyz
Copy link
Collaborator

terryyz commented Oct 24, 2024

So, for StarCoder2-15B-Instruct-v0.1, the leaderboard used dataset v0.1.2 ?

It was evaluated on the v0.1.0 dataset.

And bigcodebench PyPI==v0.2.0 is compatible with the leardboard when set inference_batch=1 ?

Ideally, the scores should be still close when using batch size as 1. I mainly want to know the scores you got on the specific subset and split, to understand how significant the difference is.

@renmengjie7
Copy link
Author

I will try it again on the v0.1.0 dataset.

Thank you !

@renmengjie7
Copy link
Author

What is the difference between dataset v0.1.0 and v0.1.2 ?

@terryyz
Copy link
Collaborator

terryyz commented Oct 24, 2024

We refined several task descriptions and test cases, which was documented under this folder.

I will try it again on the v0.1.0 dataset.

Please make sure that you use the default Gradio endpoint for evaluation. That HF space environment should be very similar to the one we originally used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants