Model Chat Template not working correctly?
Right now I'm starting it with:
llama-server -m models/Magistral-Small-2509-UD-Q6_K_XL.gguf \
--jinja \
--host 0.0.0.0 \
--port 8181 \
-ngl 99 \
-c 16384 \
-fa \
--temp 0.7 \
--top-k -1 \
--top-p 0.95 \
--mmproj models/Magistral-Small-2509-mmproj-BF16.gguf
In OpenWebUI I tested it with the prompt What is 2 * 2^3, and the output I got was:
The question is asking for the result of the multiplication of 2 and 2 raised to the power of 3.
First, let's recall the order of operations (PEMDAS/BODMAS), which states that exponentiation comes before multiplication. So, we first need to calculate \(2^3\).
Now, \(2^3\) means 2 multiplied by itself three times:
\[ 2^3 = 2 \times 2 \times 2 = 8 \]
Now that we have the result of the exponentiation, we can perform the multiplication:
\[ 2 \times 8 = 16 \]
So, the final answer is 16.The solution to the expression \(2 \times 2^3\) can be found by following the order of operations. First, we evaluate the exponentiation:
\[ 2^3 = 2 \times 2 \times 2 = 8 \]
Next, we perform the multiplication:
\[ 2 \times 8 = 16 \]
Thus, the final answer is:
\[
\boxed{16}
\]
Am I using it incorrectly or does this model not use opening/closing think tags?? (Also if it helps, I updated llama.cpp around 3 weeks ago, should I re-build it?)
Running llama-server buildb6527 and openwebui v0.6.30 I can't make it work either, the reasoning content seems to end up not being enclosed by the [THINK][/THINK] tags.
I tried setting the custom reasoning tags (which for example work as intended for the Seed-OSS model, which also as exotic tags not natively detected by openwebui). But here, I don't even see the tags in the ouptut in the first place, so I guess openwebui can't even detect them even if we set them as custom ones.
I use the GGUF template Unsloth provides using --jinja.
My llama-server command is approx the same as @qingy2024 :
/home/user/llama.cpp/build/bin/llama-server \
--model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/Magistral-Small-2509-UD-Q4_K_XL/Magistral-Small-2509-UD-Q4_K_XL.gguf \
--ctx-size 16000 \
--no-context-shift \
--n-gpu-layers 41 \
--temp 0.7 \
--top-p 0.95 \
--repeat-penalty 1 \
--jinja \
--host 0.0.0.0 \
--port ${PORT}
Only diff is that I tried disabling flash-attn and I'm not loading the mmproj. What's your magic trick @ayylmaonade :D
Getting the same using llama-cli: the tags are not there./llama-cli -m /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/Magistral-Small-2509-UD-Q4_K_XL/Magistral-Small-2509-UD-Q4_K_XL.gguf --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
> How many MB per second is that if you have a 2.5 gigabit per second link?
Okay, the question is about converting a data transfer rate from gigabits per second to megabytes per second. Let's start by recalling the basic units:
- 1 gigabit (Gb) = 10^9 bits
- 1 megabyte (MB) = 10^6 bytes
- 1 byte = 8 bits
[...]
I also have no thinking tags, even using the recommended system prompt.
llama-server
-m /root/.cache/llama.cpp/unsloth_Magistral-Small-2509-GGUF_Magistral-Small-2509-UD-Q8_K_XL.gguf
--n-gpu-layers 99
--threads 32
--threads-batch 32
--jinja
--no-mmap
-fa on
--temp 0.7
--top-k -1
--top-p 0.95
-c 40000
-n 6144
--cache-reuse 256
--port 6666
--host 0.0.0.0
--metrics \
@danielhanchen
@shimmyshimmer
Just pinging since haven't been able to figure out how to make the model's thinking tags work correctly...
Hey so you need to add --special to see [THINK] [/THINK] pop up - since it's a special token, no output will be provided for it.
Hey so you need to add
--specialto see[THINK] [/THINK]pop up - since it's a special token, no output will be provided for it.
Thanks for the quick answer! I added --special to the llama-server command and filled in reasoning tags like what
@ayylmaonade
said, and it's working perfectly now π
Thanks guys!
Confirming the trailing which has to be added as stop sequence.
But yo, I just had a look at the default system prompt Mistral is providing... It's a mess. How can they put so much effort training a model, but then right away feed it with garbage...
Sorry but seriously: redundancy, grammar, formatting, confusion between answer and response... And I'm not even an English speaker...
Here is the original one:
First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.
Here is the revised version I use:
First share your thinking process (inner monologue), then provide your answer. Take as long as you need until you are confident enough to answer.\nFormat your answer using Markdown, and use LaTeX for any mathematical equations.\nWrite both your thoughts and your answer in the same language as the input.\nYour response must respect the following template: `[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper[/THINK]Your answer`
I guess reasoning models are less affected by the quality of the system prompt, getting on their own rails quickly... But still...
What is even more horrific is to imagine they might have trained it with this prompt. I mean, is this real?
EDIT: replaced follow the following by respect the following, better like this!
Confirming the trailing which has to be added as stop sequence.
But yo, I just had a look at the default system prompt Mistral is providing... It's a mess. How can they put so much effort training a model, but then right away feed it with garbage...
Sorry but seriously: redundancy, grammar, formatting, confusion between answer and response... And I'm not even an English speaker...Here is the original one:
First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response. Use the same language as the input.[/THINK]Here, provide a self-contained response.Here is the revised version I use:
First share your thinking process (inner monologue), then provide your answer. Take as long as you need until you are confident enough to answer.\nFormat your answer using Markdown, and use LaTeX for any mathematical equations.\nWrite both your thoughts and your answer in the same language as the input.\nYour response must follow the following template: `[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper[/THINK]Your answer`I guess reasoning models are less affected by the quality of the system prompt, getting on their own rails quickly... But still...
What is even more horrific is to imagine they might have trained it with this prompt. I mean, is this real?
You make a lot of assumptions about the system prompt. Replacing their '\n\n' by '\n' is a bad idea. It's pretty much an universal constant across all models, everywhere, ever, that single new lines should be avoided due to how tokenization operates. As for how it's phrased, while i don't disagree, it really doesn't matter as long as their version was the one used during training.
Replacing their '\n\n' by '\n' is a bad idea. It's pretty much an universal constant across all models, everywhere, ever, that single new lines should be avoided due to how tokenization operates.
How the fact it's tokenized in e.g 3 vs 2 would matter. Plus, for tough cases, when you refine your prompts based on the first passes, I guess the longer you enrich details and so on, you will end up use some formatting, and even good formatting, as possible, right? Just exploiting the mirror effect. Also, I personally don't use words that are only encoded in 1 token.
I really don't see the difference here.
On top of that I guess /n and /n/n are close enough in their latent space so the paths taken won't differ in any noticeable way.
But at the end, the fact I judged to better presented like this is purely feeling. The suggestion was focusing about the rest, and mainly the redundancies.
it really doesn't matter as long as their version was the one used during training
Do you have any sources on this topic? Because that's something I often think about but didn't find anything on the topic yet :(
I know some say it could affect accuracy (Mistral included!), but I would so much have so benchmarks results on this!
No I get that you won't get into a case of tokenization failure by just using one newline character obviously. It's more about how models process information. \n (alone) in training data is generally for lists. \n\n is separate, clear indications of: do this sentence, do that second sentence, and that third one. For instance, look at Claude's system prompt, all instructions are separated by 2 line skips. You're not wasting space by using another new line, you're just inforcing the weight and importance of each individual line more so than you would with a single delimiter.
Overall, does that really matter for 99% of the use cases, no not really. LLM are good at working with sub-optimal prompts they weren't exactly trained on (that's literally their job), but pretending your prompt is better because you removed paragraphs is misleading. Note, i don't disagree that their prompt is cancer to read, suboptimal too. And as a french person, watching a French company fail at writing 2 sentences worth of very basic English bugs me more than you can imagine.
Edit. Sorry i somehow managed to ignore the second part.
Do you have any sources on this topic? Because that's something I often think about but didn't find anything on the topic yet :(I know some say it could affect accuracy (Mistral included!), but I would so much have so benchmarks results on this!
I don't have a link to provide to you no. But it's not really that complicated. If a model was trained with billions of examples where every single time the system prompt was X exactly, if you change the system prompt to Y for inference, then yeah the result is obviously going to be slightly different, and probably lesser. That's not really model dependent, it's like how training works. In practice, for a solid model, you'd want to train it with a variety of system prompts, especially task specific ones, to give it more breath. But, knowing Mistral, it's unlikely they did that, hence their very strict specifications on system prompt and inference sampling settings in general.


