Smashed πŸ’ͺ Scored to 82.86 πŸ”₯2bit IQ2_M on MMLU Pro single shot benchmark

#7
by xbruce22 - opened

Earlier the same model scored 72.86, How I improved?
Few questions in MMLU Pro bench for GLM 4.5 Air took more than 15000 tokens to answer with 25min time.
So I increased max output tokens to 32k and timeout for API server to 1hr so that our bro has enough time to think 🀣

Highly underrated model. Tool calling (instruction following one) is also decent. (better than gpt-oss 120B)

logs

+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Model                     | Dataset   | Metric          | Subset           |   Num |   Score | Cat.0   |
+===========================+===========+=================+==================+=======+=========+=========+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | computer science |    10 |  0.8    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | math             |    10 |  0.9    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | chemistry        |    10 |  0.8    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | engineering      |    10 |  0.9    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | law              |    10 |  0.5    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | biology          |    10 |  0.9    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | health           |    10 |  0.9    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | physics          |    10 |  1      | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | business         |    10 |  0.8    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | philosophy       |    10 |  0.9    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | economics        |    10 |  0.9    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | other            |    10 |  0.8    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | psychology       |    10 |  0.8    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | history          |    10 |  0.7    | default |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+
| GLM-4.5-Air-UD-IQ2_M.gguf | mmlu_pro  | AverageAccuracy | OVERALL          |   140 |  0.8286 | -       |
+---------------------------+-----------+-----------------+------------------+-------+---------+---------+

Hello, I would like to know which of the IQ2_KL model located at ubergarm/GLM-4.5-Air-GGUF, and the IQ2_M and Q2_K_XL models here, would be better. Thank you.

I have used unsloth's IQ2_M gguf (size: 44.3GB)

I have used unsloth's IQ2_M gguf (size: 44.3GB)

Can you kindly tell, what are your sample parameters? Temperature etc.

Temp: 0.0
Seed: 42
Max tokens: 32k

Thats it, I used evalscope. So you can check it out what they are using.

Thats greedy decoding. Do you know or anyone know recommend sampling parameters given by z.ai for general usage?

Sign up or log in to comment