first <think> token not getting outputted
can you help me understand why my first is not getting outputted?
below is my run command
CUDA_VISIBLE_DEVICES="2" ./build/bin/llama-server \
--model /media/mukul/t7/models/unsloth/MiniMax-M2-GGUF/UD-Q4_K_XL/MiniMax-M2-UD-Q4_K_XL-00001-of-00003.gguf \
--alias unsloth/MiniMax-M2 \
--ctx-size 98304 \
-fa on \
-b 4096 -ub 4096 \
-ot ".ffn_.*_exps.=CPU" \
--n-gpu-layers 99 \
--jinja \
--parallel 1 \
--threads 56 \
--host 0.0.0.0 \
--port 10002
Add --special and it should be outputted!
Add
--specialand it should be outputted!
That did not work for me. I added --special, am I doing it wrong?
CUDA_VISIBLE_DEVICES="2" ./build/bin/llama-server \
--model /media/mukul/t7/models/unsloth/MiniMax-M2-GGUF/UD-Q4_K_XL/MiniMax-M2-UD-Q4_K_XL-00001-of-00003.gguf \
--alias unsloth/MiniMax-M2 \
--ctx-size 98304 \
-fa on \
-b 4096 -ub 4096 \
-ot ".ffn_.*_exps.=CPU" \
--n-gpu-layers 99 \
--jinja \
--special \
--parallel 1 \
--threads 56 \
--host 0.0.0.0 \
--port 10002
i do not know how to do that? where is that file that i need to edit? Is it in the checked out git cloned repo?
1.Download the default chat template https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/chat_template.jinja
2.Fix it
3.Run model with the cmd :
./llama-server --model ../MiniMax-M2-UD-TQ1_0.gguf --alias "minimax" --threads -1 --n-gpu-layers 999 --prio 3 --temp 1.0 --top-p 0.95 --top-k 40 --ctx-size 60000 --port 8001 --host 0.0.0.0 --flash-attn on --cache-type-k q4_0 --cache-type-v q4_0 -b 4000 -ub 1024 --chat-template-file ./chat_template.jinja --jinja
Thank you! That indeed fixed it!
I appreciate all the help!
Oh yes so after doing investigation it seems the minimax chat template has the think token be default so you will not be seeing this during the output
Oh yes so after doing investigation it seems the minimax chat template has the think token be default so you will not be seeing this during the output
It works with the fix above.
Oh yes so after doing investigation it seems the minimax chat template has the think token be default so you will not be seeing this during the output
It works with the fix above.
IMO, <think> in template is meant to ensure the model will output thinking content. Without it, the model probably still generates <think> at the beginning but it's not guaranteed.
But if you keep the <think> in the template, this messes up most things that expect the <think> to be output at the start of thinking and it starts outputting the thinking response as the actual response.

