Confusion about the description of evaluation settings
Could you please provide more details on evaluation settings of GSM8K dataset?We evaluate GSM8K CoT with chat template and 8-way few shot as multiturn.
- How do you implement CoT with only chat template?
- How do you compute exact match metric, via template parse?
- What's the difference between flexible and strict extract?
Hi, these are settings that are best understood in the context of the lm-eval harness, https://github.com/EleutherAI/lm-evaluation-harness
For example, you can replicate the GSM8k evaluations like so
lm_eval --model hf --model_args pretrained=tomg-group-umd/huginn-0125,trust_remote_code=True,dtype="bfloat16",mean_recurrence=64 \
--tasks gsm8k_cot --batch_size=1 --output_path=outputs/evals --log_samples --apply_chat_template=True \
--system_instruction="You are a helpful assistant that can assist users with mathematical reasoning." --fewshot_as_multiturn
Pick a model checkpoint you want for the pretrained argument, and a recurrence argument for mean_recurrence.
You can find the exact definition of flexible extract in the eval harness here: https://github.com/EleutherAI/lm-evaluation-harness/blob/52df63b7b30da53c481ed9090598d9189fab1d91/lm_eval/tasks/gsm8k/gsm8k-cot.yaml#L55
Thanks for your reply. I also noticed that there's a gsm8k_long_cot.yaml file in evaluate_raven, which is different from lm_eval's gsm8k-cot.yaml file. Is this file useful in reproducing the result?
By the way, I'd like to confirm whether "w/o sys. prompt" means --system_instruction=None in Table 2? And does the configuration for GSM8K and GSM8K CoT correspond to --task gsm8k_cot_zeroshot and --task gsm8k_cot --fewshot_as_multiturn, respectively?
Looking forward to your reply!
Ah no, that file is only useful if you wanted to use more than 8 fewshot examples, it's not used for evaluation in this work.
w/o system prompt:--apply_chat_template=False --system_instruction=None
w system prompt:--apply_chat_template=True --system_instruction="You are a helpful assistant that can assist users with mathematical reasoning." --fewshot_as_multiturn
The GSM8k column is the standard GSM8k setup, so --task gsm8k.
The GSM8k CoT column is --task gsm8k_cot. (EDIT: this one had a cmd too many)
Why is there a --fewshot_as_multiturn in w system prompt setting? So the GSM8k (which should be zero-shot) column with w system prompt row contains --fewshot_as_multiturn?
--fewshot_as_multiturn is a no-op if there are no fewshot examples, it only determines that if there are fewshot examples, they should be prepared as multiple messages, instead of being in a single message together with the actual query.
Thanks again for solving all my problems!