Evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard #53

renmengjie7 · 2024-10-23T14:50:15Z

What is the evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard
I run evaluation but got different performance. Any suggestions or insights ?

Will the repository release the inference results and execution results of the models in the leaderboard ? This will help reproduction. 🙏🙏

terryyz · 2024-10-23T15:01:49Z

Hi @renmengjie7,

Would you mind sharing the reproduced results (e.g., full/hard, complete/instruct)?

IMO the main difference would be: (1) batch inference (could be ~5% on the hard subset), which has been mentioned in README, (2) --strip_newlines as mentioned in the ADVANCED_USAGE (I forgot if I used for the self-instruct StarCoder2), and (3) some updated task descriptions.

Cheers

renmengjie7 · 2024-10-24T08:24:43Z

inference.json

This is a jsonl file

renmengjie7 · 2024-10-24T08:26:39Z

Would you mind sharing the inference results, sanitized results and execution results of StarCoder2-15B-Instruct-v0.1 in the leaderboard ?

I want know which step is unaligned with yours.

terryyz · 2024-10-24T08:30:17Z

Thanks, so what's the final score you got after the evaluation?

For the calibrated outputs, please refer to this section. Please note that it was done a few months ago, which was done without batch inference.

terryyz · 2024-10-24T08:34:46Z

BTW, based on the share output format, it seems that you were not using the latest bigcodebench version..?

renmengjie7 · 2024-10-24T08:36:28Z

yes, I used version 2 before

But I noticed that the notes on the leaderboard said it used version 1, so i changed.

terryyz · 2024-10-24T08:40:37Z

Thanks! That's the BigCodeBench dataset version, but not the PyPI version. Recent models were evaluated on v0.1.2 dataset. You can change the dataset version here.

renmengjie7 · 2024-10-24T08:45:39Z

So, for StarCoder2-15B-Instruct-v0.1, the leaderboard used dataset v0.1.2 ?

And bigcodebench PyPI==v0.2.0 is compatible with the leardboard when set inference_batch=1 ?

terryyz · 2024-10-24T08:48:54Z

So, for StarCoder2-15B-Instruct-v0.1, the leaderboard used dataset v0.1.2 ?

It was evaluated on the v0.1.0 dataset.

And bigcodebench PyPI==v0.2.0 is compatible with the leardboard when set inference_batch=1 ?

Ideally, the scores should be still close when using batch size as 1. I mainly want to know the scores you got on the specific subset and split, to understand how significant the difference is.

renmengjie7 · 2024-10-24T08:52:04Z

I will try it again on the v0.1.0 dataset.

Thank you !

renmengjie7 · 2024-10-24T08:53:20Z

What is the difference between dataset v0.1.0 and v0.1.2 ?

terryyz · 2024-10-24T09:02:35Z

We refined several task descriptions and test cases, which was documented under this folder.

I will try it again on the v0.1.0 dataset.

Please make sure that you use the default Gradio endpoint for evaluation. That HF space environment should be very similar to the one we originally used.

terryyz self-assigned this Oct 24, 2024

terryyz added the question Further information is requested label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard #53

Evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard #53

renmengjie7 commented Oct 23, 2024

terryyz commented Oct 23, 2024

renmengjie7 commented Oct 24, 2024 •

edited

Loading

renmengjie7 commented Oct 24, 2024 •

edited

Loading

terryyz commented Oct 24, 2024

terryyz commented Oct 24, 2024

renmengjie7 commented Oct 24, 2024

terryyz commented Oct 24, 2024

renmengjie7 commented Oct 24, 2024

terryyz commented Oct 24, 2024 •

edited

Loading

renmengjie7 commented Oct 24, 2024

renmengjie7 commented Oct 24, 2024

terryyz commented Oct 24, 2024 •

edited

Loading

Evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard #53

Evaluation setting for StarCoder2-15B-Instruct-v0.1 in the leaderboard #53

Comments

renmengjie7 commented Oct 23, 2024

terryyz commented Oct 23, 2024

renmengjie7 commented Oct 24, 2024 • edited Loading

renmengjie7 commented Oct 24, 2024 • edited Loading

terryyz commented Oct 24, 2024

terryyz commented Oct 24, 2024

renmengjie7 commented Oct 24, 2024

terryyz commented Oct 24, 2024

renmengjie7 commented Oct 24, 2024

terryyz commented Oct 24, 2024 • edited Loading

renmengjie7 commented Oct 24, 2024

renmengjie7 commented Oct 24, 2024

terryyz commented Oct 24, 2024 • edited Loading

renmengjie7 commented Oct 24, 2024 •

edited

Loading

renmengjie7 commented Oct 24, 2024 •

edited

Loading

terryyz commented Oct 24, 2024 •

edited

Loading

terryyz commented Oct 24, 2024 •

edited

Loading