Qwen3-30B-A3B-Thinking-2507-NVFP4

This is a 4-bit NVFP4 quantized version of Qwen/Qwen3-30B-A3B-Thinking-2507, compressed using llmcompressor.

Model Description

This model represents a significant compression of the original 30B parameter Qwen3 thinking model, reducing the model size by approximately 75% while maintaining most of its reasoning capabilities. The quantization was performed using NVIDIA's FP4 (4-bit floating point) format, which is optimized for deployment on NVIDIA GPUs with Blackwell architecture.

Quantization Details

  • Method: NVFP4 (NVIDIA 4-bit Floating Point)
  • Tool: llmcompressor v0.3.0+
  • Original Size: ~60-120GB (depending on precision)
  • Compressed Size: ~18GB
  • Compression Ratio: ~4-8x

Quantization Configuration

targets: Linear
scheme: NVFP4
ignore:
  - lm_head
  - model.embed_tokens
  - re:.*input_layernorm$
  - re:.*post_attention_layernorm$
  - model.norm
  - re:.*mlp.gate$

Key layers preserved at full precision:

  • Output head (lm_head)
  • Embeddings
  • Layer normalization layers
  • MLP gate layers

Calibration Dataset

The model was calibrated using 1,250 samples from the NVIDIA Llama-Nemotron Post-Training Dataset:

  • 250 samples from math split
  • 250 samples from code split
  • 250 samples from science split
  • 250 samples from chat split
  • 250 samples from safety split

All samples were filtered for:

  • Reasoning mode enabled (reasoning: on)
  • Maximum sequence length of 20,000 tokens

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "your-username/Qwen3-30B-A3B-Thinking-2507-NVFP4"

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Use the model
messages = [
    {"role": "user", "content": "Solve this step by step: What is 25 * 48?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

With vLLM (Recommended for Production)

NVFP4 quantized models are optimized for deployment with vLLM:

from vllm import LLM, SamplingParams

model_id = "your-username/Qwen3-30B-A3B-Thinking-2507-NVFP4"

llm = LLM(model=model_id)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)

prompts = ["Solve step by step: What is 25 * 48?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Performance Characteristics

Advantages

  • Memory Efficiency: ~75% reduction in memory requirements
  • Faster Inference: Reduced memory bandwidth requirements lead to faster token generation
  • Deployment Flexibility: Can run on GPUs with smaller VRAM
  • Preserved Quality: Critical layers maintained at full precision

Trade-offs

  • Slight accuracy degradation compared to full precision model
  • Best performance on NVIDIA GPUs with FP4 support
  • May require specific deployment frameworks for optimal performance

Limitations

  • This is a quantized model with some accuracy trade-offs
  • Performance is optimized for NVIDIA GPUs
  • Not all inference frameworks support NVFP4 format natively
  • The model retains the same context length limitations as the original

Citation

If you use this model, please cite both the original model and the quantization method:

@misc{qwen3-thinking-2507,
  title={Qwen3-30B-A3B-Thinking-2507},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face}
}

@software{llmcompressor,
  title={LLM Compressor},
  author={vLLM Team},
  url={https://github.com/vllm-project/llm-compressor},
  year={2024}
}

License

This model follows the same license as the original Qwen3-30B-A3B-Thinking-2507 model.

Acknowledgments

  • Original model by Qwen Team
  • Quantization performed using llmcompressor by vLLM Team
  • Calibration dataset provided by NVIDIA
Downloads last month
44
Safetensors
Model size
17B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4

Quantized
(69)
this model