Fork with more benchmarks and features: Merge some of them? #7
matthiasgeihs
started this conversation in
Show and tell
Replies: 1 comment
-
If canonical solutions are already in the training dataset it means that one should invalidate the results. Seems like a big problem for reproducibility :( |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey @abacaj, I really like your repository. I always found it a bit puzzling how these benchmark results have been created, because pre- and post-processing does play a role and it is often not clearly documented. Your repository solves that!
I am the maintainer of fork torusresearch/code-eval. There we added some things:
validate.py
which analyzes this overlap and added a corresponding column in the table.I am happy to contribute (a subset of) these features to this repository. Let me know.
Beta Was this translation helpful? Give feedback.
All reactions