why recently re-uploaded the core?

#7
by Geximus - opened

Hi cpatonn !

  1. I noticed that you recently re-uploaded the core safetensors weight shards and some config files (config.json, generation_config.json, recipe.yaml) for Qwen3-Next-80B-A3B-Thinking-AWQ-4bit — most files were from ~2 months ago, but the main weights were updated just 2 days ago.
    Could you please clarify what motivated the re-upload of the model weights?

I’m very interested in understanding:
Was it a bug fix, improvement in quantization quality, or regeneration of the AWQ weights?
Is this a newer or improved recipe/setup compared to the previous upload?
Did anything change in act-order, calibration dataset, or AWQ parameters?
Are the new weights expected to perform better than the earlier version?
(e.g., reasoning, stability, hallucinations, speed, memory footprint)

Should users who downloaded the older version re-download the updated one?

Any details would be greatly appreciated — just want to understand what exactly has been improved or changed.
Thanks again for all the work you’re doing!

  1. Also, I have one more question regarding model quality.

I downloaded the Thinking version of the model about two months ago, and when comparing it with the Qwen3-Next-80B-A3B-Instruct-AWQ-4bit from your repo, I noticed that the older Thinking build tended to make more mistakes:

its chain-of-thought often didn’t match the final answer,
reasoning was sometimes inconsistent,
and for harder tasks the Instruct version actually performed better.

Could these issues be related to the recent re-upload of the model weight files?
In other words — does the new upload address any problems with the earlier Thinking quantization?
I'm wondering whether the latest weights include fixes or improvements that could reduce the reasoning inconsistencies I observed.

Thanks again for your work — just trying to understand the changes.

Hi @Geximus ,

Thank you for your interests in my model. I’m more than happy to explain the changes in the recent upload.

The main changes in the recent updates are:

  1. Less quantization losses due to an improvement in quantization algorithms.
  2. MTP layers supported!

Other than those, AWQ quantization recipes and configs remain the same.

The changes would allow:

  1. More similar with the original BF16 models from the quantized. Perplexity and benchmarks show that the model degrades by less than 1%, i.e., 0.3% increase in perplexity and 0.7% decrease in GPQA Diamond evaluation.
  2. Faster when used with the MTP layer and speculative decoding.

I would recommend redownloading if you prefer speculative decoding, and consider 1-2% increases in accuracy important.

In regards to the previous release of the quantized thinking model, if the inconsistent reasoning and false CoT are results of quantization losses, then those issues are much less likely to occur in the recently updated model. In the previous model version, did those problems occur or worsen in long contexts? or do those occur even in the first 1-2 prompts? Knowing these might help me improve future models.

Thank you again for trying my model!

I use the Instruct model in combination with RAG for legal reasoning tasks — and in medium to moderately complex questions, Instruct consistently delivers highly reliable and precise results: its answers are clear, logically structured, and fully capture all key legal aspects from the context. In contrast, the “Thinking” version, despite being explicitly designed for deep reasoning, often produces excessive, irrelevant tangents (“fluff”), loses logical coherence midway through its chain of thought, and frequently ends up with incomplete or insufficiently grounded conclusions. I tested both models on five legal questions using RAG context, and in every case, the “Thinking” model scored lower — precisely because it missed critical legal provisions, misinterpreted contextual references, or drifted into faulty analogies. While the “Thinking” model’s responses sound generally polite and grammatically sound, in practice they are less dependable than Instruct’s. This is particularly surprising, since one would expect a model explicitly optimized for reasoning to outperform the instruct-tuned version on analytical tasks — not underperform it.
At the same time, I use the exact same prompt for both models. It’s possible that the “Thinking” model requires a more tailored prompt — for example, one that explicitly instructs it to rigorously verify its own reasoning, or to ensure its final answer fully captures all nuances relevant to solving the question, or perhaps other more specific directives. The first version of the “Thinking” model performed significantly worse than it does after your update two days ago, but even the new version still underperforms compared to Instruct. However, this gap might be narrowed—or even closed—with a more carefully designed prompt. Do you have any recommendations for crafting prompts specifically optimized for the “Thinking” model, taking into account its unique architecture and behavioral tendencies?

Sign up or log in to comment