Qwen3-30B-A3B-Thinking-2507-NVFP4

This is a 4-bit NVFP4 quantized version of Qwen/Qwen3-30B-A3B-Thinking-2507, compressed using llmcompressor.

Model Description

This model represents a significant compression of the original 30B parameter Qwen3 thinking model, reducing the model size by approximately 75% while maintaining most of its reasoning capabilities. The quantization was performed using NVIDIA's FP4 (4-bit floating point) format, which is optimized for deployment on NVIDIA GPUs with Blackwell architecture.

Quantization Details

Method: NVFP4 (NVIDIA 4-bit Floating Point)
Tool: llmcompressor v0.3.0+
Original Size: ~60-120GB (depending on precision)
Compressed Size: ~18GB
Compression Ratio: ~4-8x

Quantization Configuration

targets: Linear
scheme: NVFP4
ignore:
  - lm_head
  - model.embed_tokens
  - re:.*input_layernorm$
  - re:.*post_attention_layernorm$
  - model.norm
  - re:.*mlp.gate$

Key layers preserved at full precision:

Output head (lm_head)
Embeddings
Layer normalization layers
MLP gate layers

Calibration Dataset

The model was calibrated using 1,250 samples from the NVIDIA Llama-Nemotron Post-Training Dataset:

250 samples from math split
250 samples from code split
250 samples from science split
250 samples from chat split
250 samples from safety split

All samples were filtered for:

Reasoning mode enabled (reasoning: on)
Maximum sequence length of 20,000 tokens

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "your-username/Qwen3-30B-A3B-Thinking-2507-NVFP4"

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Use the model
messages = [
    {"role": "user", "content": "Solve this step by step: What is 25 * 48?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

With vLLM (Recommended for Production)

NVFP4 quantized models are optimized for deployment with vLLM:

from vllm import LLM, SamplingParams

model_id = "your-username/Qwen3-30B-A3B-Thinking-2507-NVFP4"

llm = LLM(model=model_id)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)

prompts = ["Solve step by step: What is 25 * 48?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Performance Characteristics

Advantages

Memory Efficiency: ~75% reduction in memory requirements
Faster Inference: Reduced memory bandwidth requirements lead to faster token generation
Deployment Flexibility: Can run on GPUs with smaller VRAM
Preserved Quality: Critical layers maintained at full precision

Trade-offs

Slight accuracy degradation compared to full precision model
Best performance on NVIDIA GPUs with FP4 support
May require specific deployment frameworks for optimal performance

Limitations

This is a quantized model with some accuracy trade-offs
Performance is optimized for NVIDIA GPUs
Not all inference frameworks support NVFP4 format natively
The model retains the same context length limitations as the original

Citation

If you use this model, please cite both the original model and the quantization method:

@misc{qwen3-thinking-2507,
  title={Qwen3-30B-A3B-Thinking-2507},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face}
}

@software{llmcompressor,
  title={LLM Compressor},
  author={vLLM Team},
  url={https://github.com/vllm-project/llm-compressor},
  year={2024}
}

License

This model follows the same license as the original Qwen3-30B-A3B-Thinking-2507 model.

Acknowledgments

Original model by Qwen Team
Quantization performed using llmcompressor by vLLM Team
Calibration dataset provided by NVIDIA

Downloads last month: 44

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for MrVolts/Qwen3-30B-A3B-Thinking-2507-NVFP4

Base model

Qwen/Qwen3-30B-A3B-Thinking-2507

Quantized

(69)

this model