llama.cpp missing <think>
I'm lost there. I was having the issue where the <think> token wasn't prepend to the LM output.
Digging a bit, I saw there are no more reference to add_generation_prompt in https://github.com/ggml-org/llama.cpp/tree/master/tools/server
I was confused, so I tried removing the condition:
Before
{%- if add_generation_prompt -%}
{%- if thinking -%}
{{- "<|im_start|>assistant\n<think>\n" -}}
{%- else -%}
{{- "<|im_start|>assistant\n" -}}
{%- endif -%}
{%- endif -%}
After
{%- if thinking -%}
{{- "<|im_start|>assistant\n<think>\n" -}}
{%- else -%}
{{- "<|im_start|>assistant\n" -}}
{%- endif -%}
But then, even if I start my llama.cpp server with --chat-template-kwargs '{"thinking": true}' I still didn't get the expected behavior.
So in despair, I tried removing the {%- if thinking -%} condition too!
Ending only with:
{{- "<|im_start|>assistant\n<think>\n" -}}
And... It worked! I now always get the <think> tag prepend the and therefore always get the reasoning behavior.
But why? That's a mystery to me!
If anyone has an idea of what's going on there I'd be happy to know more.
By the way, thanks to @Xenova for their https://huggingface.co/spaces/Xenova/jinja-playground!
same here too
https://huggingface.co/aquif-ai/aquif-3.5-Max-42B-A3B/discussions/6#690da8dc15b4cbbdff9d882d
i did templete to work with lm studio but that too is not working properly
@gopi87
Before we get help from someone else, you can use it with the changes above. I don't know whether lmstudio uses their own jinja parser of if they use the one from llama.cpp (which I read was minja instead of ninja for llama.cpp, but I didn't check if it's still the case and I don't even know if it's coming from there), but it at least works 100% of the time with llama.cpp!
Tool calls are a problem though. I didnt' try extensively yesterday, but from my early attempts, it failed with standard tool calling format (like OpenCode or Crush if I'm right) but worked flawlessly with the custom Roocode one for example.
But at least I can make some use of the model now :)
Edit: not working well at all with RooCode, it tends to do too much steps in one tool call all the time. I didn't try it enough for regular use, but for the agentic case it doesn't seem so great :/
Edit 2: last time I tried with RooCode, it did the exact opposite, editing files 1 line per 1 line, like 20 tool calls for less than 30 edited lines! I don't know what to think!
I noticed "srv init: thinking = 0" was output by llama.cpp server, then I find:
// thinking is enabled if:
// 1. It's not explicitly disabled (reasoning_budget == 0)
// 2. The chat template supports it
const bool enable_thinking = params_base.use_jinja && params_base.reasoning_budget != 0 && common_chat_templates_support_enable_thinking(chat_templates.get());
SRV_INF("thinking = %d\n", enable_thinking);
bool common_chat_templates_support_enable_thinking(const common_chat_templates * chat_templates) {
common_chat_templates_inputs dummy_inputs;
common_chat_msg msg;
msg.role = "user";
msg.content = "test";
dummy_inputs.messages = {msg};
dummy_inputs.enable_thinking = false;
const auto rendered_no_thinking = common_chat_templates_apply(chat_templates, dummy_inputs);
dummy_inputs.enable_thinking = true;
const auto rendered_with_thinking = common_chat_templates_apply(chat_templates, dummy_inputs);
return rendered_no_thinking.prompt != rendered_with_thinking.prompt;
}
...
static std::string apply(
const common_chat_template & tmpl,
const struct templates_params & inputs,
const std::optional<json> & messages_override = std::nullopt,
const std::optional<json> & tools_override = std::nullopt,
const std::optional<json> & additional_context = std::nullopt)
{
minja::chat_template_inputs tmpl_inputs;
tmpl_inputs.messages = messages_override ? *messages_override : inputs.messages;
if (tools_override) {
tmpl_inputs.tools = *tools_override;
} else {
tmpl_inputs.tools = inputs.tools.empty() ? json() : inputs.tools;
}
tmpl_inputs.add_generation_prompt = inputs.add_generation_prompt;
tmpl_inputs.extra_context = inputs.extra_context;
tmpl_inputs.extra_context["enable_thinking"] = inputs.enable_thinking;
if (additional_context) {
tmpl_inputs.extra_context.merge_patch(*additional_context);
}
// TODO: add flag to control date/time, if only for testing purposes.
// tmpl_inputs.now = std::chrono::system_clock::now();
minja::chat_template_options tmpl_opts;
// To avoid double BOS / EOS tokens, we're manually removing begining / trailing tokens
// instead of using `chat_template_options.use_bos_token = false`, since these tokens
// may be needed inside the template / between messages too.
auto result = tmpl.apply(tmpl_inputs, tmpl_opts);
if (inputs.add_bos && string_starts_with(result, tmpl.bos_token())) {
result = result.substr(tmpl.bos_token().size());
}
if (inputs.add_eos && string_ends_with(result, tmpl.eos_token())) {
result = result.substr(0, result.size() - tmpl.eos_token().size());
}
return result;
}
so I changed the variable "thinking" in the template to "enable_thinking", and it seems to work