unsloth/Magistral-Small-2509-GGUF · Model Chat Template not working correctly?

qingy2024

Sep 18

Right now I'm starting it with:

llama-server -m models/Magistral-Small-2509-UD-Q6_K_XL.gguf \
  --jinja \
  --host 0.0.0.0 \
  --port 8181 \
  -ngl 99 \
  -c 16384 \
  -fa \
  --temp 0.7 \
  --top-k -1 \
  --top-p 0.95 \
  --mmproj models/Magistral-Small-2509-mmproj-BF16.gguf

In OpenWebUI I tested it with the prompt What is 2 * 2^3, and the output I got was:

The question is asking for the result of the multiplication of 2 and 2 raised to the power of 3.

First, let's recall the order of operations (PEMDAS/BODMAS), which states that exponentiation comes before multiplication. So, we first need to calculate \(2^3\).

Now, \(2^3\) means 2 multiplied by itself three times:
\[ 2^3 = 2 \times 2 \times 2 = 8 \]

Now that we have the result of the exponentiation, we can perform the multiplication:
\[ 2 \times 8 = 16 \]

So, the final answer is 16.The solution to the expression \(2 \times 2^3\) can be found by following the order of operations. First, we evaluate the exponentiation:

\[ 2^3 = 2 \times 2 \times 2 = 8 \]

Next, we perform the multiplication:

\[ 2 \times 8 = 16 \]

Thus, the final answer is:

\[
\boxed{16}
\]

Am I using it incorrectly or does this model not use opening/closing think tags?? (Also if it helps, I updated llama.cpp around 3 weeks ago, should I re-build it?)

ayylmaonade

Sep 18

It's because Open-WebUI doesn't support the [THINK] [/THINK] tags it uses by default, so you've got to set it to those reasoning tags under the model specifically. Just go to the model, reasoning tags, custom, and set the ones I said earlier. Works like a charm. Requires 0.6.28+ iirc.

owao

Sep 20

Running llama-server buildb6527 and openwebui v0.6.30 I can't make it work either, the reasoning content seems to end up not being enclosed by the [THINK][/THINK] tags.
I tried setting the custom reasoning tags (which for example work as intended for the Seed-OSS model, which also as exotic tags not natively detected by openwebui). But here, I don't even see the tags in the ouptut in the first place, so I guess openwebui can't even detect them even if we set them as custom ones.

I use the GGUF template Unsloth provides using --jinja.

My llama-server command is approx the same as @qingy2024 :

/home/user/llama.cpp/build/bin/llama-server \
      --model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/Magistral-Small-2509-UD-Q4_K_XL/Magistral-Small-2509-UD-Q4_K_XL.gguf \
      --ctx-size 16000 \
      --no-context-shift \
      --n-gpu-layers 41 \
      --temp 0.7 \
      --top-p 0.95 \
      --repeat-penalty 1 \
      --jinja \
      --host 0.0.0.0 \
      --port ${PORT}

Only diff is that I tried disabling flash-attn and I'm not loading the mmproj. What's your magic trick @ayylmaonade :D

owao

Sep 20

Getting the same using llama-cli: the tags are not there
./llama-cli -m /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/Magistral-Small-2509-UD-Q4_K_XL/Magistral-Small-2509-UD-Q4_K_XL.gguf --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99

> How many MB per second is that if you have a 2.5 gigabit per second link?
Okay, the question is about converting a data transfer rate from gigabits per second to megabytes per second. Let's start by recalling the basic units:

- 1 gigabit (Gb) = 10^9 bits
- 1 megabyte (MB) = 10^6 bytes
- 1 byte = 8 bits
[...]

ravage382

Sep 21

I also have no thinking tags, even using the recommended system prompt.
llama-server
-m /root/.cache/llama.cpp/unsloth_Magistral-Small-2509-GGUF_Magistral-Small-2509-UD-Q8_K_XL.gguf
--n-gpu-layers 99
--threads 32
--threads-batch 32
--jinja
--no-mmap
-fa on
--temp 0.7
--top-k -1
--top-p 0.95
-c 40000
-n 6144
--cache-reuse 256
--port 6666
--host 0.0.0.0
--metrics \

qingy2024

Sep 21

@danielhanchen @shimmyshimmer
Just pinging since haven't been able to figure out how to make the model's thinking tags work correctly...

danielhanchen

Unsloth AI org Sep 21

Hey so you need to add --special to see [THINK] [/THINK] pop up - since it's a special token, no output will be provided for it.

qingy2024

Sep 21

Hey so you need to add --special to see [THINK] [/THINK] pop up - since it's a special token, no output will be provided for it.

Thanks for the quick answer! I added --special to the llama-server command and filled in reasoning tags like what @ayylmaonade said, and it's working perfectly now 😊

qingy2024 changed discussion status to closed Sep 21

qingy2024 changed discussion status to open Sep 21

qingy2024

Sep 21

So actually I did a bit more testing and now it's ending it's messages with </s> (text-only and image prompts), but a quick band-aid solution is just to set the "Stop Sequence" for the model to </s> in Open WebUI. Maybe the chat template still needs a tweak for the EOS token?

owao

Sep 21

Thanks guys!

owao

Sep 21

•

edited Sep 29

Confirming the trailing which has to be added as stop sequence.
But yo, I just had a look at the default system prompt Mistral is providing... It's a mess. How can they put so much effort training a model, but then right away feed it with garbage...
Sorry but seriously: redundancy, grammar, formatting, confusion between answer and response... And I'm not even an English speaker...

Here is the original one:

First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.

Here is the revised version I use:

First share your thinking process (inner monologue), then provide your answer. Take as long as you need until you are confident enough to answer.\nFormat your answer using Markdown, and use LaTeX for any mathematical equations.\nWrite both your thoughts and your answer in the same language as the input.\nYour response must respect the following template: `[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper[/THINK]Your answer`

I guess reasoning models are less affected by the quality of the system prompt, getting on their own rails quickly... But still...
What is even more horrific is to imagine they might have trained it with this prompt. I mean, is this real?

EDIT: replaced follow the following by respect the following, better like this!

SerialKicked

Sep 24

Confirming the trailing which has to be added as stop sequence.
But yo, I just had a look at the default system prompt Mistral is providing... It's a mess. How can they put so much effort training a model, but then right away feed it with garbage...
Sorry but seriously: redundancy, grammar, formatting, confusion between answer and response... And I'm not even an English speaker...

Here is the original one:
First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.
Here is the revised version I use:
First share your thinking process (inner monologue), then provide your answer. Take as long as you need until you are confident enough to answer.\nFormat your answer using Markdown, and use LaTeX for any mathematical equations.\nWrite both your thoughts and your answer in the same language as the input.\nYour response must follow the following template: `[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper[/THINK]Your answer`
I guess reasoning models are less affected by the quality of the system prompt, getting on their own rails quickly... But still...
What is even more horrific is to imagine they might have trained it with this prompt. I mean, is this real?

You make a lot of assumptions about the system prompt. Replacing their '\n\n' by '\n' is a bad idea. It's pretty much an universal constant across all models, everywhere, ever, that single new lines should be avoided due to how tokenization operates. As for how it's phrased, while i don't disagree, it really doesn't matter as long as their version was the one used during training.

owao

Sep 24

•

edited Sep 24

Replacing their '\n\n' by '\n' is a bad idea. It's pretty much an universal constant across all models, everywhere, ever, that single new lines should be avoided due to how tokenization operates.

How the fact it's tokenized in e.g 3 vs 2 would matter. Plus, for tough cases, when you refine your prompts based on the first passes, I guess the longer you enrich details and so on, you will end up use some formatting, and even good formatting, as possible, right? Just exploiting the mirror effect. Also, I personally don't use words that are only encoded in 1 token.
I really don't see the difference here.
On top of that I guess /n and /n/n are close enough in their latent space so the paths taken won't differ in any noticeable way.

But at the end, the fact I judged to better presented like this is purely feeling. The suggestion was focusing about the rest, and mainly the redundancies.

it really doesn't matter as long as their version was the one used during training

Do you have any sources on this topic? Because that's something I often think about but didn't find anything on the topic yet :(
I know some say it could affect accuracy (Mistral included!), but I would so much have so benchmarks results on this!

SerialKicked

Sep 26

•

edited Sep 27

No I get that you won't get into a case of tokenization failure by just using one newline character obviously. It's more about how models process information. \n (alone) in training data is generally for lists. \n\n is separate, clear indications of: do this sentence, do that second sentence, and that third one. For instance, look at Claude's system prompt, all instructions are separated by 2 line skips. You're not wasting space by using another new line, you're just inforcing the weight and importance of each individual line more so than you would with a single delimiter.

Overall, does that really matter for 99% of the use cases, no not really. LLM are good at working with sub-optimal prompts they weren't exactly trained on (that's literally their job), but pretending your prompt is better because you removed paragraphs is misleading. Note, i don't disagree that their prompt is cancer to read, suboptimal too. And as a french person, watching a French company fail at writing 2 sentences worth of very basic English bugs me more than you can imagine.

Edit. Sorry i somehow managed to ignore the second part.

Do you have any sources on this topic? Because that's something I often think about but didn't find anything on the topic yet :(I know some say it could affect accuracy (Mistral included!), but I would so much have so benchmarks results on this!

I don't have a link to provide to you no. But it's not really that complicated. If a model was trained with billions of examples where every single time the system prompt was X exactly, if you change the system prompt to Y for inference, then yeah the result is obviously going to be slightly different, and probably lesser. That's not really model dependent, it's like how training works. In practice, for a solid model, you'd want to train it with a variety of system prompts, especially task specific ones, to give it more breath. But, knowing Mistral, it's unlikely they did that, hence their very strict specifications on system prompt and inference sampling settings in general.