Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new subtask to SCORE tasks: non greedy robustness #2558

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
88d3a28
score readme added
rimashahbazyan Oct 22, 2024
a4fcc64
generate until task's "until" parameter's default value fixed.
rimashahbazyan Oct 22, 2024
cc3084f
score mmlu-pro and agieval added
rimashahbazyan Oct 22, 2024
6fa3cf9
changed macro accuracy to micro for agieval
rimashahbazyan Oct 29, 2024
e06db0c
Always E removed from agi eval
rimashahbazyan Oct 30, 2024
b11c053
redundancies removed
rimashahbazyan Oct 30, 2024
211fd13
MATH added
rimashahbazyan Oct 30, 2024
f6730d9
minor cosmetic changes for math
rimashahbazyan Oct 30, 2024
5981dc4
Licenses added Readme updated
rimashahbazyan Nov 1, 2024
998c54b
changes for flake8 + license header on math
rimashahbazyan Nov 5, 2024
0258930
Score added to readme and precommit was run.
rimashahbazyan Nov 6, 2024
04ddccc
Score added to readme and precommit was run.
rimashahbazyan Nov 6, 2024
f43a422
Merge branch 'score_tasks' of github.com:rimashahbazyan/lm-evaluation…
rimashahbazyan Nov 6, 2024
8ad246d
Import error fixed
rimashahbazyan Nov 6, 2024
c15aa8a
math task bugfix
rimashahbazyan Nov 13, 2024
fab6fed
CR for math added
rimashahbazyan Nov 14, 2024
0387d60
math CR
rimashahbazyan Nov 14, 2024
37f50c7
math task bugfix
rimashahbazyan Nov 13, 2024
91fa830
Merge branch 'score_tasks' of github.com:rimashahbazyan/lm-evaluation…
rimashahbazyan Nov 14, 2024
7177e96
Math cr fixed
rimashahbazyan Nov 15, 2024
b29b637
mmlu_pro non_greedy task added
rimashahbazyan Nov 25, 2024
82c1cd9
non greedy summarizer added
rimashahbazyan Nov 26, 2024
5937888
Non greedy for all score tasks
rimashahbazyan Dec 2, 2024
d852029
Bugfixes for non-greedy
rimashahbazyan Dec 3, 2024
56d87af
Merge branch 'main' of github.com:rimashahbazyan/lm-evaluation-harnes…
rimashahbazyan Dec 3, 2024
2eebecc
fixing the until argument
rimashahbazyan Dec 3, 2024
7bbb8de
undoing the change to "until" arguments default behaviour
rimashahbazyan Dec 3, 2024
3181b1f
minor fix in summarizer
rimashahbazyan Dec 3, 2024
0d204e5
log naming changes for better readability
rimashahbazyan Dec 3, 2024
ac87ec5
math subtasks naming fix
rimashahbazyan Dec 3, 2024
69bab66
agieval subtask naming fix
rimashahbazyan Dec 3, 2024
9d2aa3e
logging added for debugging
rimashahbazyan Dec 3, 2024
c6cf775
path issue fixed
rimashahbazyan Dec 3, 2024
9c41c9d
minor fix
rimashahbazyan Dec 3, 2024
7ce10a7
path fix
rimashahbazyan Dec 3, 2024
1fc37c3
path fix
rimashahbazyan Dec 3, 2024
e73371d
non_greedy_math minor fix
rimashahbazyan Dec 3, 2024
cfb8747
final changes
rimashahbazyan Dec 3, 2024
eb90986
changed readme for non-greedy
rimashahbazyan Dec 4, 2024
35d62af
non greedy summarizer bugfix
rimashahbazyan Dec 6, 2024
5200ea6
non_greedy summarizer fixed
rimashahbazyan Dec 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions lm_eval/tasks/score/NON_GREEDY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
```
Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
````
# Non Greedy Evaluation

This task checks for model's consistency towards seed changes during generation.
More particularly it evaluates the model's accuracy and consistancy rate with 5
different seeds (seed = 1, 2,...,5) for a fixed prompt with temperature set to 0.7.

## How to run the Non-Greedy evaluation of SCORE?

Evaluation for non greedy tasks differs a bit from other score tasks as it is required to pass different seeds as an argument manually. Below you can find the step-by-step guide on how to correctly run the **Score Non-Greedy** evaluation.

To run the evaluation of the Non-Greedy tasks with 5 different seeds you should:
1. For a given dataset run the evaluation by
* specifying the task as `score_non_greedy_robustness_{DATASET_NAME}` (`DATASET_NAME` being either`agieval`, `mmlu_pro` or `math`)
* fixing the seed with the run argument `--seed=1`
* passing the `--log_samples` argument*
* specifying an output with `--output_path=SOME_OUTPUT_PATH/seed_1`
* if running with vllm it is important to set the seed in the `--model_args` just by specifying the `seed` parameter\

2. Repeat the process for 5 times**, changing the `--seed` and the `--output_path` arguments accordingly from 1 to 5.

3. When all 5 runs are finished and logs are saved, run the `./lm_eval/tasks/score/non_greedy_summarizer.py` script by passing the the output directory of the above runs to the `--log_dir` argument***, and by specifying the dataset name for which the evaluations were run with `--dataset` argument(`agieval`, `mmlu_pro` or `math`). \

4. The script will return the default lm_evaluation_harness table where accuracies for each seed and the consistancy rate are calculated.


\* _As this evaluation requires `--log_samples` to be True, it will need some extra disk space to save the prediction results for each seed._

\*\* _Refer to [`./lm_eval/tasks/score/non_greedy.sh`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/score/non_greedy.sh) to see an example of non greedy evaluation command for each seed._

\*\*\* _To `--log_dir` argument one should pass the path of the parent folder of `"seed_1", "seed_2", ...` directories, that is not necessarily the `--output_path` passed to the evaulater in the 1st step._
14 changes: 11 additions & 3 deletions lm_eval/tasks/score/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ limitations under the License.

## Tasks

Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the following 2 tasks:
Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the following 3 tasks:

* Option order robustness:
`score_option_order_robustness_mmlu_pro`,
Expand All @@ -41,10 +41,14 @@ Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the foll
`score_prompt_robustness_mmlu_pro`,
`score_prompt_robustness_agieval`,

Whereas math contains only
* Non greedy robustness
`score_non_greedy_robustness_mmlu_pro`,
`score_non_greedy_robustness_agieval`,

Whereas math contains the following 2:
* Prompt robustness:
`score_prompt_robustness_math`

`score_non_greedy_robustness_math`,

### Option order robustness

Expand All @@ -55,6 +59,10 @@ Measures the model's robustness to the placement of the correct answer in the op
Measures the model's robustness to 10 different prompts. list of the prompts can be found in the `./prompt_templates.json` file under the key `prompt_robustness`.


### Non greedy robustness

Measures the model's robustness to 5 different seeds: seeds = \[1-5\]. For evaluating on the non greedy task, please, refer to [NON_GREEDY.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/score/NON_GREEDY.md)

## Metrics

All robustness tasks calculate 2 metrics: *Accuracy* and *Consistency Rate(CR)* [[4](#cr)].
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

task: non_greedy_robustness_agieval_aqua_rat
dataset_path: hails/agieval-aqua-rat
dataset_name: default
output_type: generate_until
test_split: test
process_docs: !function utils_agieval.non_greedy_robustness_process_docs
doc_to_text: !function utils_agieval.agi_eval_robustness_doc_to_text
doc_to_target: answer
generation_kwargs:
max_gen_toks: 1024
do_sample: true
temperature: 0.7
until: []
process_results: !function utils_agieval.non_greedy_robustness_process_results
metric_list:
- metric: non_greedy_accuracy
aggregation: !function utils_agieval.non_greedy_accuracy
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_logiqa_en
dataset_path: hails/agieval-logiqa-en
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_rc
dataset_path: hails/agieval-lsat-rc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_ar
dataset_path: hails/agieval-lsat-ar
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_lsat_lr
dataset_path: hails/agieval-lsat-lr
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_sat_en
dataset_path: hails/agieval-sat-en
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

include: non_greedy_robustness_agieval_aqua_rat.yaml
task: non_greedy_robustness_agieval_sat_math
dataset_path: hails/agieval-sat-math
6 changes: 5 additions & 1 deletion lm_eval/tasks/score/agi_eval/prompt_templates.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
{
"option_order_robustness":{
"prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion: {question}{options}\n\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
"prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion:{question}{options}\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
"options_format": "\n{letter}: {option}"
},

"non_greedy_robustness":{
"prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion:{question}{options}\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
"options_format": "\n{letter}: {option}"
},

"prompt_robustness":[
{
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

group: score_non_greedy_robustness_agieval
task:
- non_greedy_robustness_agieval_aqua_rat
- non_greedy_robustness_agieval_logiqa_en
- non_greedy_robustness_agieval_lsat_ar
- non_greedy_robustness_agieval_lsat_lr
- non_greedy_robustness_agieval_lsat_rc
- non_greedy_robustness_agieval_sat_en
- non_greedy_robustness_agieval_sat_math

aggregate_metric_list:
- metric: non_greedy_accuracy
aggregation: mean
weight_by_size: true

metadata:
version: 1.0
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,6 @@ group: score_robustness_agieval
task:
- score_prompt_robustness_agieval
- score_option_order_robustness_agieval
- score_non_greedy_robustness_agieval
metadata:
version: 1.0
32 changes: 32 additions & 0 deletions lm_eval/tasks/score/agi_eval/utils_agieval.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@

PROMPT_ROBUSTNESS_TEMPLATE_KEY = "prompt_robustness"
OPTION_ORDER_ROBUSTNESS_TEMPLATE_KEY = "option_order_robustness"
NON_GREEDY_ROBUSTNESS_TEMPLATE_KEY = "non_greedy_robustness"

QUESTION_KEY = "query"
ANSWER_INDEX_KEY = "gold"
Expand Down Expand Up @@ -93,6 +94,13 @@ def __process(_doc, idx):
dataset_specific_preprocess=initial_process_docs,
)

non_greedy_robustness_process_docs = partial(
utils.non_greedy_robustness_process_docs,
templates_key=NON_GREEDY_ROBUSTNESS_TEMPLATE_KEY,
template_file_path=TEMPLATE_FILE_PATH,
dataset_specific_preprocess=initial_process_docs,
)


def prompt_robustness_process_results(doc, results) -> Dict[str, float]:
final_answer = utils.__postprocess_pred(results[0])
Expand Down Expand Up @@ -135,6 +143,17 @@ def option_order_robustness_process_results(doc, results) -> Dict[str, float]:
}


def non_greedy_robustness_process_results(doc, results) -> Dict[str, float]:
final_answer = utils.__postprocess_pred(results[0])
final_answer = utils.translate_model_answer_to_labels(
final_answer, option_format=doc["options_format"], labels=LABELS
)
question_id = doc["question_id"]
gt = LABELS[doc["answer_index"]]

return {"non_greedy_accuracy": (question_id, final_answer, gt, None)}


def per_prompt_accuracy(results: List[Dict[str, Any]], p_id=0) -> float:
accuracies = []
for result in results:
Expand Down Expand Up @@ -181,3 +200,16 @@ def per_option_accuracy(results: List[Dict[str, Any]], always_opt="a") -> float:
per_option_accuracy_d = partial(per_option_accuracy, always_opt="D")

options_consistency_rate = partial(utils.options_consistency_rate, labels=LABELS)


def non_greedy_accuracy(results: List[Dict[str, Any]]) -> float:
accuracies = []
for result in results:
question_id, final_answer, gt, category = result

accuracies.append(final_answer == gt)

accuracy = sum(accuracies) / len(accuracies)
eval_logger.info(f"Non greedy accuracy: {accuracy}")

return np.round(accuracy, 4)
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

task: non_greedy_robustness_math_algebra
dataset_path: EleutherAI/hendrycks_math
dataset_name: algebra
output_type: generate_until
test_split: test
process_docs: !function utils_math.non_greedy_robustness_process_docs
doc_to_text: !function utils_math.math_robustness_doc_to_text
doc_to_target: answer
generation_kwargs:
max_gen_toks: 1024
do_sample: true
temperature: 0.7
until: []
process_results: !function utils_math.non_greedy_robustness_process_results
metric_list:
- metric: non_greedy_accuracy
aggregation: !function utils_math.non_greedy_accuracy
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

include: non_greedy_robustness_math_algebra.yaml
dataset_name: counting_and_probability
task: non_greedy_robustness_math_counting_and_prob
Loading
Loading