You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to express appreciation for the platform and benchmark you’ve created. It really helps to compare different models and methods.
However, I’ve noticed that the evaluation and comparison on the leaderboard can sometimes be a bit noisy, and I’ve observed a few methods using certain “tricks” that might improve their performance in ways that aren’t ideal. Here are a couple of things I’ve noticed:
Hyperparameter optimization on the test dataset
Since the test datasets are accessible, I found that several methods optimize the hyperparameters in the test dataset instead of the normal practice that only optimize in the validation dataset.
Using both training and validation data for training
Some methods use both training and validation dataset for training in all runs.... This will lead to lack of variations in the training dataset (always the same across all runs). Also, since the dataset sizes are normally less than 1000, including the validation dataset can considerably improve performance.
We conducted some investigations on the methods in the leaderboard and summarized our findings in Table S2.5 of our recent paper.
I’m not sure if it’s good to set clear rules about these practices (though they are common practices in machine learning but lots of methods ignore them).
The text was updated successfully, but these errors were encountered:
@feiyang-cai thank you very much for pointing this out. We'll see what we can do. I'll have a look at your paper to see the cases where this happened. We can send a warning and override the leaderboards, if needed, with the MPC task which spans these datasets.
I would like to express appreciation for the platform and benchmark you’ve created. It really helps to compare different models and methods.
However, I’ve noticed that the evaluation and comparison on the leaderboard can sometimes be a bit noisy, and I’ve observed a few methods using certain “tricks” that might improve their performance in ways that aren’t ideal. Here are a couple of things I’ve noticed:
Since the test datasets are accessible, I found that several methods optimize the hyperparameters in the test dataset instead of the normal practice that only optimize in the validation dataset.
Some methods use both training and validation dataset for training in all runs.... This will lead to lack of variations in the training dataset (always the same across all runs). Also, since the dataset sizes are normally less than 1000, including the validation dataset can considerably improve performance.
We conducted some investigations on the methods in the leaderboard and summarized our findings in Table S2.5 of our recent paper.
I’m not sure if it’s good to set clear rules about these practices (though they are common practices in machine learning but lots of methods ignore them).
The text was updated successfully, but these errors were encountered: