unsloth/aquif-3.5-Max-42B-A3B-GGUF · llama.cpp missing <think>

owao

18 days ago

I'm lost there. I was having the issue where the <think> token wasn't prepend to the LM output.

Digging a bit, I saw there are no more reference to add_generation_prompt in https://github.com/ggml-org/llama.cpp/tree/master/tools/server
I was confused, so I tried removing the condition:
Before

{%- if add_generation_prompt -%}
    {%- if thinking -%}
        {{- "<|im_start|>assistant\n<think>\n" -}}
    {%- else -%}
        {{- "<|im_start|>assistant\n" -}}
    {%- endif -%}
{%- endif -%}

After

    {%- if thinking -%}
        {{- "<|im_start|>assistant\n<think>\n" -}}
    {%- else -%}
        {{- "<|im_start|>assistant\n" -}}
    {%- endif -%}

But then, even if I start my llama.cpp server with --chat-template-kwargs '{"thinking": true}' I still didn't get the expected behavior.

So in despair, I tried removing the {%- if thinking -%} condition too!
Ending only with:

        {{- "<|im_start|>assistant\n<think>\n" -}}

And... It worked! I now always get the <think> tag prepend the and therefore always get the reasoning behavior.
But why? That's a mystery to me!

If anyone has an idea of what's going on there I'd be happy to know more.

By the way, thanks to @Xenova for their https://huggingface.co/spaces/Xenova/jinja-playground!

owao changed discussion title from llama.cpp <think> not forced to llama.cpp missing <think> 18 days ago

gopi87

18 days ago

same here too

gopi87

18 days ago

•

edited 18 days ago

https://huggingface.co/aquif-ai/aquif-3.5-Max-42B-A3B/discussions/6#690da8dc15b4cbbdff9d882d

i did templete to work with lm studio but that too is not working properly

owao

17 days ago

•

edited 14 days ago

@gopi87 Before we get help from someone else, you can use it with the changes above. I don't know whether lmstudio uses their own jinja parser of if they use the one from llama.cpp (which I read was minja instead of ninja for llama.cpp, but I didn't check if it's still the case and I don't even know if it's coming from there), but it at least works 100% of the time with llama.cpp!
Tool calls are a problem though. I didnt' try extensively yesterday, but from my early attempts, it failed with standard tool calling format (like OpenCode or Crush if I'm right) ~~but worked flawlessly with the custom Roocode one for example~~.
But at least I can make some use of the model now :)

Edit: not working well at all with RooCode, it tends to do too much steps in one tool call all the time. I didn't try it enough for regular use, but for the agentic case it doesn't seem so great :/
Edit 2: last time I tried with RooCode, it did the exact opposite, editing files 1 line per 1 line, like 20 tool calls for less than 30 edited lines! I don't know what to think!

Zor-X-L

12 days ago

I noticed "srv init: thinking = 0" was output by llama.cpp server, then I find:

        // thinking is enabled if:
        // 1. It's not explicitly disabled (reasoning_budget == 0)
        // 2. The chat template supports it
        const bool enable_thinking = params_base.use_jinja && params_base.reasoning_budget != 0 && common_chat_templates_support_enable_thinking(chat_templates.get());
        SRV_INF("thinking = %d\n", enable_thinking);

bool common_chat_templates_support_enable_thinking(const common_chat_templates * chat_templates) {
    common_chat_templates_inputs dummy_inputs;
    common_chat_msg msg;
    msg.role = "user";
    msg.content = "test";
    dummy_inputs.messages = {msg};
    dummy_inputs.enable_thinking = false;
    const auto rendered_no_thinking = common_chat_templates_apply(chat_templates, dummy_inputs);
    dummy_inputs.enable_thinking = true;
    const auto rendered_with_thinking = common_chat_templates_apply(chat_templates, dummy_inputs);
    return rendered_no_thinking.prompt != rendered_with_thinking.prompt;
}

...

static std::string apply(
    const common_chat_template & tmpl,
    const struct templates_params & inputs,
    const std::optional<json> & messages_override = std::nullopt,
    const std::optional<json> & tools_override = std::nullopt,
    const std::optional<json> & additional_context = std::nullopt)
{
    minja::chat_template_inputs tmpl_inputs;
    tmpl_inputs.messages = messages_override ? *messages_override : inputs.messages;
    if (tools_override) {
        tmpl_inputs.tools = *tools_override;
    } else {
        tmpl_inputs.tools = inputs.tools.empty() ? json() : inputs.tools;
    }
    tmpl_inputs.add_generation_prompt = inputs.add_generation_prompt;
    tmpl_inputs.extra_context = inputs.extra_context;
    tmpl_inputs.extra_context["enable_thinking"] = inputs.enable_thinking;
    if (additional_context) {
        tmpl_inputs.extra_context.merge_patch(*additional_context);
    }
    // TODO: add flag to control date/time, if only for testing purposes.
    // tmpl_inputs.now = std::chrono::system_clock::now();

    minja::chat_template_options tmpl_opts;
    // To avoid double BOS / EOS tokens, we're manually removing begining / trailing tokens
    // instead of using `chat_template_options.use_bos_token = false`, since these tokens
    // may be needed inside the template / between messages too.
    auto result = tmpl.apply(tmpl_inputs, tmpl_opts);
    if (inputs.add_bos && string_starts_with(result, tmpl.bos_token())) {
        result = result.substr(tmpl.bos_token().size());
    }
    if (inputs.add_eos && string_ends_with(result, tmpl.eos_token())) {
        result = result.substr(0, result.size() - tmpl.eos_token().size());
    }
    return result;
}

so I changed the variable "thinking" in the template to "enable_thinking", and it seems to work