Describe the bug
As far as I can tell MMLU-Pro is prompting models incorrectly.
MMLU-Pro questions have 10 options but the prompt reads (link):
Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.
NOTE: It would be worthwhile to also check the format of the fewshot examples is consistent with the official harness. The pipeline is a bit messy to trace, but it looks to me like this is also different from how the official harness does it.
To Reproduce
I don't have a nice MWE for this. I was benchmarking DeepSeek-V4-Flash via VLLM using the official Tiger AI Lab harness and LightEval and noticed a 15% drop in performance for the lightEval experiment when disabling reasoning. Interestingly enabling reasoning seems to make the model able to deal with the quirks of the LightEval implementation.
Expected behavior
The model should be given the same prompts in the LightEval harness as in the original Tiger Labs harness. As it is now the task is more difficult in LightEval.
Version info
I used LightEval v 0.13
Describe the bug
As far as I can tell MMLU-Pro is prompting models incorrectly.
MMLU-Pro questions have 10 options but the prompt reads (link):
NOTE: It would be worthwhile to also check the format of the fewshot examples is consistent with the official harness. The pipeline is a bit messy to trace, but it looks to me like this is also different from how the official harness does it.
To Reproduce
I don't have a nice MWE for this. I was benchmarking DeepSeek-V4-Flash via VLLM using the official Tiger AI Lab harness and LightEval and noticed a 15% drop in performance for the lightEval experiment when disabling reasoning. Interestingly enabling reasoning seems to make the model able to deal with the quirks of the LightEval implementation.
Expected behavior
The model should be given the same prompts in the LightEval harness as in the original Tiger Labs harness. As it is now the task is more difficult in LightEval.
Version info
I used LightEval v 0.13