-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zero accuracy on mmlu_generative
#2279
Comments
I would look at the generations in the samples file, and also add some fewshots to the context (say |
I think it's pretty weird, and it may not related to in-context learning. I also evaluate And I tested on
|
Qwen2-0.5B-Instruct on
|
I'll take a look! My guess is a bug in the answer extraction |
Hello, I am having similar result (0 for all subtasks) and I am wondering if you have figured it out? |
Hello, I also have this error while using the mmlu_generative task to benchmark the llama3 model. Command:
Result:
I also try to dump some intermediate result after add some log info: a) the prompt input text: add print log for the generate_until API in lm_eval/models/huggingface.py
b) LLM response from self._model_generate:
It seems the response result looks normal, but the value of exact_match from the final result table is always 0. Could you plase help to take a look? Thanks |
Hello
I have been able to solve this.
I had only change the exact match to this:
def exact_match(gold, pred=None):
if pred is None and isinstance(gold, list):
if len(gold) != 2:
raise ValueError("If passing a single list argument, it must contain
exactly two elements.")
gold, pred = gold
gold = str(gold).strip().upper()
pred = str(pred).strip()
if not pred:
print("Warning: pred is empty")
return 0.0
pred_first_char = pred[0].upper()
value = 1.0 if gold == pred_first_char else 0.0
return value
and I used the exact match her
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no
auxiliary_train split
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: generate_until
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC.
{{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_target: "{{['A', 'B', 'C', 'D'][answer]}}"
generation_kwargs:
until:
- "</s>"
- "\n"
metric_list:
- metric: !function utils.exact_match
aggregation: mean
higher_is_better: true
metadata:
version: 2.0
dataset_kwargs:
trust_remote_code: true
Let me know if you have any other questions.
Best
Aisha.
…On Thu, 31 Oct 2024 at 11:24, yshi ***@***.***> wrote:
Hello, I also have this error while using the mmlu_generative task to
benchmark the llama3 model.
Command:
python3 main.py \
--model hf \
--model_args pretrained=model-path\
--tasks mmlu_humanities_generative \
--limit 3 \
--output_path output/ \
--write_out
Result:
| Tasks |Version|Filter|n-shot| Metric | |Value| |Stderr|
|----------------------------|------:|------|-----:|-----------|---|----:|---|-----:|
|formal_logic | 2|none | 0|exact_match|↑ | 0|± | 0|
|high_school_european_history| 2|none | 0|exact_match|↑ | 0|± | 0|
|high_school_us_history | 2|none | 0|exact_match|↑ | 0|± | 0|
|high_school_world_history | 2|none | 0|exact_match|↑ | 0|± | 0|
|international_law | 2|none | 0|exact_match|↑ | 0|± | 0|
|jurisprudence | 2|none | 0|exact_match|↑ | 0|± | 0|
|logical_fallacies | 2|none | 0|exact_match|↑ | 0|± | 0|
|moral_disputes | 2|none | 0|exact_match|↑ | 0|± | 0|
|moral_scenarios | 2|none | 0|exact_match|↑ | 0|± | 0|
|philosophy | 2|none | 0|exact_match|↑ | 0|± | 0|
|prehistory | 2|none | 0|exact_match|↑ | 0|± | 0|
|professional_law | 2|none | 0|exact_match|↑ | 0|± | 0|
|world_religions | 2|none | 0|exact_match|↑ | 0|± | 0|
I also try to dump some intermediate result after add some log info:
a) the prompt input text: add print log for the generate_until API in
lm_eval/models/huggingface.py
The following are multiple choice questions (with answers) about world religions.
Which of the following plays the most significant role in forming a child's political views?
A. The geographical area in which the child grows up
B. The child's family
C. The media to which the child is exposed
D. The child's religion
Answer:
b) LLM response from self._model_generate:
The child's religion
It seems the response result looks normal, but the value of exact_match
from the final result table is always 0.
Could you plase help to take a look? Thanks
—
Reply to this email directly, view it on GitHub
<#2279 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK3TI7DCFNY5IFSLKV56N7TZ6HSM7AVCNFSM6AAAAABNV5XDY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBZGI4TSNRTGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
DISCLAIMER: The contents of this email and any attachments are
confidential. They are intended for the named recipient(s) only. If you
have received this email by mistake, please notify the sender immediately
and you are herewith notified that the contents are legally privileged and
that you do not have permission to disclose the contents to anyone, make
copies thereof, retain or distribute or act upon it by any means,
electronically, digitally or in print. The views expressed in this
communication may be of a personal nature and not be representative of
AIMS-NEI and/or any of its Centres or Initiatives.
|
It is a bug in the extraction filtering. Take a look at the this log:
it returns If we modify the task and templates like this: files and changes- `_mmlu.yaml`group: mmlu_generative
group_alias: mmlu (generative)
task:
- group: stem
task:
- mmlu_stem_generative
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: get_response
- group: other
task:
- mmlu_other_generative
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: get_response
- group: social sciences
task:
- mmlu_social_sciences_generative
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: get_response
- group: humanities
task:
- mmlu_humanities_generative
aggregate_metric_list:
- metric: exact_match
weight_by_size: True
filter_list: get_response
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: True
filter_list: get_response
metadata:
version: 2
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: generate_until
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_target: "{{['A', 'B', 'C', 'D'][answer]}}"
generation_kwargs:
until:
- "</s>"
- "\n"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
filter_list:
- name: get_response
filter:
# Filter everything after the first break line
- function: "regex"
regex_pattern: "^(.*?)(?=\\n|$)"
# Remove leading white spaces
- function: remove_whitespace
# function to ignore right white spaces or line breaks
- function: "regex"
regex_pattern: "^(.*?)\\s*$"
- function: take_first
metadata:
version: 2.0
dataset_kwargs:
trust_remote_code: true We will get the expected result:
see I tested this on Qwen2.5-32B-Instruct-AWQ (only 50 samples)
This is the same problem I observed in BBH, I'm planning on creaiting a PR later Edit: Added 'take_first' to filter, it changes nothing here (in terms of results), but it breaks exact match if multiple words are going to be matched. |
Hi RawthiL we can see the output matches the context of D, but the exact_match is equal to 0 since the response after the filter is not equal to "D". Do you have any experience with this special response for the filter? Thanks
|
It looks like you are doing zero-shot (presenting no examples prior asking the question), this results in the model not being conditioned to respond with a letter (instead an explicit response) and hence the exact match fails. |
Got it, many thanks for your help. |
Opened a PR to fix this : |
Hi thanks for providing such wonderful evaluation toolkit.
I was wondering why evaluation on
mmlu_generative
returns 0 accuracy whenever what models I try (pythia, qwen).I understand it as a generative version of mmlu, it can be used to evaluate base/instruct model and match the model's output to a formatted target answer ""{{['(A)', '(B)', '(C)', '(D)'][answer]}}""
My command:
Results:
The text was updated successfully, but these errors were encountered: