EleutherAI · rimashahbazyan · Oct 22, 2024 · Oct 22, 2024 · Oct 22, 2024 · Oct 29, 2024
@@ -0,0 +1,45 @@
+```
+Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+````
+# Non Greedy Evaluation
+
+This task checks for model's consistency towards seed changes during generation.
+More particularly it evaluates the model's accuracy and consistancy rate with 5
+different seeds (seed = 1, 2,...,5) for a fixed prompt with temperature set to 0.7.
+
+## How to run the Non-Greedy evaluation of SCORE?
+
+Evaluation for non greedy tasks differs a bit from other score tasks as it is required to pass different seeds as an argument manually. Below you can find the step-by-step guide on how to correctly run the **Score Non-Greedy** evaluation.
+
+To run the evaluation of the Non-Greedy tasks with 5 different seeds you should:
+1. For a given dataset run the evaluation by
+   * specifying the task as `score_non_greedy_robustness_{DATASET_NAME}` (`DATASET_NAME` being either`agieval`, `mmlu_pro` or `math`)
+   * fixing the seed with the run argument `--seed=1`
+   * passing the `--log_samples` argument*
+   * specifying an output with `--output_path=SOME_OUTPUT_PATH/seed_1`
+   * if running with vllm it is important to set the seed in the `--model_args` just by specifying the `seed` parameter\
+
+2. Repeat the process for 5 times**, changing the `--seed` and the `--output_path` arguments accordingly from 1 to 5.
+
+3. When all 5 runs are finished and logs are saved, run the `./lm_eval/tasks/score/non_greedy_summarizer.py` script by passing the the output directory of the above runs to the `--log_dir` argument***, and by specifying the dataset name for which the evaluations were run with `--dataset` argument(`agieval`, `mmlu_pro` or `math`). \
+
+4. The script will return the default lm_evaluation_harness table where accuracies for each seed and the consistancy rate are calculated.
+
+
+\* _As this evaluation requires `--log_samples` to be True, it will need some extra disk space to save the prediction results for each seed._
+
+\*\* _Refer to [`./lm_eval/tasks/score/non_greedy.sh`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/score/non_greedy.sh) to see an example of non greedy evaluation command for each seed._
+
+\*\*\* _To `--log_dir` argument one should pass the path of the parent folder of `"seed_1", "seed_2", ...` directories, that is not necessarily the `--output_path` passed to the evaulater in the 1st step._
@@ -31,7 +31,7 @@ limitations under the License.
 
 ## Tasks
 
-Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the following 2 tasks:
+Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the following 3 tasks:
 
 * Option order robustness:
 `score_option_order_robustness_mmlu_pro`,
@@ -41,10 +41,14 @@ Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the foll
 `score_prompt_robustness_mmlu_pro`,
 `score_prompt_robustness_agieval`,
 
-Whereas math contains only
+* Non greedy robustness
+`score_non_greedy_robustness_mmlu_pro`,
+`score_non_greedy_robustness_agieval`,
+
+Whereas math contains the following 2:
 * Prompt robustness:
 `score_prompt_robustness_math`
-
+`score_non_greedy_robustness_math`,
 
 ### Option order robustness
 
@@ -55,6 +59,10 @@ Measures the model's robustness to the placement of the correct answer in the op
 Measures the model's robustness to 10 different prompts. list of the prompts can be found in the `./prompt_templates.json` file under the key `prompt_robustness`.
 
 
+### Non greedy robustness
+
+Measures the model's robustness to 5 different seeds: seeds = \[1-5\]. For evaluating on the non greedy task, please, refer to [NON_GREEDY.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/score/NON_GREEDY.md)
+
 ## Metrics
 
 All robustness tasks calculate 2 metrics: *Accuracy* and *Consistency Rate(CR)* [[4](#cr)].

@@ -0,0 +1,36 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+task: non_greedy_robustness_agieval_aqua_rat
+dataset_path: hails/agieval-aqua-rat
+dataset_name: default
+output_type: generate_until
+test_split: test
+process_docs: !function utils_agieval.non_greedy_robustness_process_docs
+doc_to_text: !function utils_agieval.agi_eval_robustness_doc_to_text
+doc_to_target: answer
+generation_kwargs:
+  max_gen_toks: 1024
+  do_sample: true
+  temperature: 0.7
+  until: []
+process_results: !function utils_agieval.non_greedy_robustness_process_results
+metric_list:
+  - metric: non_greedy_accuracy
+    aggregation:  !function utils_agieval.non_greedy_accuracy
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: non_greedy_robustness_agieval_aqua_rat.yaml
+task: non_greedy_robustness_agieval_logiqa_en
+dataset_path: hails/agieval-logiqa-en
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: non_greedy_robustness_agieval_aqua_rat.yaml
+task: non_greedy_robustness_agieval_lsat_rc
+dataset_path: hails/agieval-lsat-rc
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: non_greedy_robustness_agieval_aqua_rat.yaml
+task: non_greedy_robustness_agieval_lsat_ar
+dataset_path: hails/agieval-lsat-ar
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: non_greedy_robustness_agieval_aqua_rat.yaml
+task: non_greedy_robustness_agieval_lsat_lr
+dataset_path: hails/agieval-lsat-lr
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: non_greedy_robustness_agieval_aqua_rat.yaml
+task: non_greedy_robustness_agieval_sat_en
+dataset_path: hails/agieval-sat-en
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: non_greedy_robustness_agieval_aqua_rat.yaml
+task: non_greedy_robustness_agieval_sat_math
+dataset_path: hails/agieval-sat-math
@@ -1,9 +1,13 @@
 {
     "option_order_robustness":{
-        "prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion: {question}{options}\n\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
+        "prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion:{question}{options}\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
         "options_format": "\n{letter}: {option}"
     },
 
+    "non_greedy_robustness":{
+        "prompt": "For the multiple-choice question, which option (A-E) is correct?.\n\nQuestion:{question}{options}\nEnd the answer with the following:\nThe best answer is (the_answer_letter) where the (the_answer_letter) is one of 'A', 'B', 'C', 'D' or 'E'.",
+        "options_format": "\n{letter}: {option}"
+    },
 
     "prompt_robustness":[
             {

@@ -0,0 +1,31 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+group: score_non_greedy_robustness_agieval
+task:
+  - non_greedy_robustness_agieval_aqua_rat
+  - non_greedy_robustness_agieval_logiqa_en
+  - non_greedy_robustness_agieval_lsat_ar
+  - non_greedy_robustness_agieval_lsat_lr
+  - non_greedy_robustness_agieval_lsat_rc
+  - non_greedy_robustness_agieval_sat_en
+  - non_greedy_robustness_agieval_sat_math
+
+aggregate_metric_list:
+  - metric: non_greedy_accuracy
+    aggregation: mean
+    weight_by_size: true
+
+metadata:
+  version: 1.0
@@ -16,5 +16,6 @@ group: score_robustness_agieval
 task:
   - score_prompt_robustness_agieval
   - score_option_order_robustness_agieval
+  - score_non_greedy_robustness_agieval
 metadata:
   version: 1.0
@@ -29,6 +29,7 @@
 
 PROMPT_ROBUSTNESS_TEMPLATE_KEY = "prompt_robustness"
 OPTION_ORDER_ROBUSTNESS_TEMPLATE_KEY = "option_order_robustness"
+NON_GREEDY_ROBUSTNESS_TEMPLATE_KEY = "non_greedy_robustness"
 
 QUESTION_KEY = "query"
 ANSWER_INDEX_KEY = "gold"
@@ -93,6 +94,13 @@ def __process(_doc, idx):
     dataset_specific_preprocess=initial_process_docs,
 )
 
+non_greedy_robustness_process_docs = partial(
+    utils.non_greedy_robustness_process_docs,
+    templates_key=NON_GREEDY_ROBUSTNESS_TEMPLATE_KEY,
+    template_file_path=TEMPLATE_FILE_PATH,
+    dataset_specific_preprocess=initial_process_docs,
+)
+
 
 def prompt_robustness_process_results(doc, results) -> Dict[str, float]:
     final_answer = utils.__postprocess_pred(results[0])
@@ -135,6 +143,17 @@ def option_order_robustness_process_results(doc, results) -> Dict[str, float]:
     }
 
 
+def non_greedy_robustness_process_results(doc, results) -> Dict[str, float]:
+    final_answer = utils.__postprocess_pred(results[0])
+    final_answer = utils.translate_model_answer_to_labels(
+        final_answer, option_format=doc["options_format"], labels=LABELS
+    )
+    question_id = doc["question_id"]
+    gt = LABELS[doc["answer_index"]]
+
+    return {"non_greedy_accuracy": (question_id, final_answer, gt, None)}
+
+
 def per_prompt_accuracy(results: List[Dict[str, Any]], p_id=0) -> float:
     accuracies = []
     for result in results:
@@ -181,3 +200,16 @@ def per_option_accuracy(results: List[Dict[str, Any]], always_opt="a") -> float:
 per_option_accuracy_d = partial(per_option_accuracy, always_opt="D")
 
 options_consistency_rate = partial(utils.options_consistency_rate, labels=LABELS)
+
+
+def non_greedy_accuracy(results: List[Dict[str, Any]]) -> float:
+    accuracies = []
+    for result in results:
+        question_id, final_answer, gt, category = result
+
+        accuracies.append(final_answer == gt)
+
+    accuracy = sum(accuracies) / len(accuracies)
+    eval_logger.info(f"Non greedy accuracy: {accuracy}")
+
+    return np.round(accuracy, 4)
@@ -0,0 +1,36 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+task: non_greedy_robustness_math_algebra
+dataset_path: EleutherAI/hendrycks_math
+dataset_name: algebra
+output_type: generate_until
+test_split: test
+process_docs: !function utils_math.non_greedy_robustness_process_docs
+doc_to_text:  !function utils_math.math_robustness_doc_to_text
+doc_to_target: answer
+generation_kwargs:
+  max_gen_toks: 1024
+  do_sample: true
+  temperature: 0.7
+  until: []
+process_results: !function utils_math.non_greedy_robustness_process_results
+metric_list:
+  - metric: non_greedy_accuracy
+    aggregation:  !function utils_math.non_greedy_accuracy
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: non_greedy_robustness_math_algebra.yaml
+dataset_name: counting_and_probability
+task: non_greedy_robustness_math_counting_and_prob