Bits for embedding / lm-head / non expert layers

by sokann - opened Aug 12

Aug 12

Embedding layer and lm-head layer are fallback to 8 bits and non expert layers are fallback to 4 bits.

Hi, can you explain a bit about the choices? Asking because I see others typically going with slightly higher bits for the non expert layers, compared to the embedding / lm-head layers.

For example, the IQ2_K quant from https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF uses:

# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k

wenhuach

Intel org Aug 12

We have not explored the mixed-bit configuration, our choices are primarily based on prior experience. You may be able to use our library to produce an even better model, or find more effective settings in other repositories. Our open-sourced algorithm differs from the official implementation. We are working on an even better version, stay tuned!

For the embedding layer and LM head, given that this model is extremely large and the first/last layer is particularly important, we choose to keep them at higher bit precision. Although we have observed that the embedding layer can be quantized to very low bits, we prefer higher bits to preserve accuracy.

For the expert modules and other related components, since experts account for the majority of the model parameters, they must be quantized to lower bits to save memory. In contrast, shared expert layers or routed experts contain relatively few parameters, so we keep them at higher precision. A similar strategy can be found in some open-source models like openai gpt-oss model that only quantize MoE parameters.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment