OLMo-2-0325-32B-Instruct AWQ 4-bit
This is a 4-bit AWQ quantized version of allenai/OLMo-2-0325-32B-Instruct using LLM Compressor.
Key Features
- ✅ 32B parameters quantized to 4-bit - 69% size reduction
- ✅ Fully open model - code, data, and training details all public
- ✅ Post-trained on Tülu 3 - SFT → DPO → RLVR pipeline
- ✅ Strong performance - competitive with Llama 3.1 70B on many tasks
- ✅ State-of-the-art on specific tasks - MATH, GSM8K, IFEval
Model Details
- Base Model: allenai/OLMo-2-0325-32B-Instruct (32B parameters)
- Architecture: OLMo 2 (fully open language model)
- Quantization Method: AWQ (Activation-aware Weight Quantization)
- Quantization Scheme: W4A16 (4-bit weights, 16-bit activations)
- Calibration Dataset: OpenOrca (128 samples)
Size Comparison
| Metric | Value |
|---|---|
| Original (BF16) | ~64.0 GB |
| Quantized (W4A16) | ~16.91 GB |
| Reduction | ~73.6% |
| Memory Saved | ~47.1 GB |
About OLMo 2
OLMo 2 is a series of fully open language models by the Allen Institute for AI:
- Training: Trained on Dolma dataset
- Post-training: Supervised finetuning, DPO, and RLVR on Tülu 3
- Performance: Competitive with much larger models
- Openness: All code, data, and training details released
Performance Highlights
- Average Score: 68.8 across diverse benchmarks
- GSM8K: 87.6 (math reasoning)
- IFEval: 85.6 (instruction following)
- MATH: 49.7 (mathematical problem solving)
- MMLU: 77.3 (general knowledge)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"ronantakizawa/olmo2-32b-instruct-awq-w4a16",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"ronantakizawa/olmo2-32b-instruct-awq-w4a16",
trust_remote_code=True
)
# Chat template
messages = [
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Chat Template
The model uses this chat template format:
<|user|>
How are you doing?
<|assistant|>
I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>
System Prompt (Optional)
In Ai2 demos, this system prompt is used by default:
You are OLMo 2, a helpful and harmless AI Assistant built by the Allen Institute for AI.
However, the model has not been trained with a specific system prompt requirement.
Quantization Details
- Method: AWQ (Activation-aware Weight Quantization)
- Independent Pipeline: Used with BasicPipeline for layer-by-layer quantization
- Calibration: 128 OpenOrca samples
- Max Sequence Length: 512 tokens
- Why AWQ: Preserves important weights based on activation patterns
Requirements
- Transformers (install from main branch for OLMo 2 support)
- PyTorch with AWQ/GPTQ support
- 20GB+ GPU VRAM for inference
Limitations
- Quantization may cause slight quality degradation compared to BF16
- Limited safety training (not production-ready without additional filtering)
- Primarily English language support
License
Apache 2.0 (same as base model)
Citation
@article{olmo20242olmo2furious,
title={2 OLMo 2 Furious},
author={Team OLMo and Pete Walsh and Luca Soldaini and others},
year={2024},
eprint={2501.00656},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.00656},
}
@misc{olmo2-32b-awq,
title={OLMo-2-0325-32B-Instruct AWQ 4-bit},
author={Quantized by ronantakizawa},
year={2025},
url={https://huggingface.co/ronantakizawa/olmo2-32b-instruct-awq-w4a16}
}
Acknowledgements
- Base model by Allen Institute for AI
- Quantization using LLM Compressor
🤖 Generated with LLM Compressor
- Downloads last month
- 4
Model tree for ronantakizawa/olmo2-32b-instruct-awq
Base model
allenai/OLMo-2-0325-32B
Finetuned
allenai/OLMo-2-0325-32B-SFT
Finetuned
allenai/OLMo-2-0325-32B-DPO
Finetuned
allenai/OLMo-2-0325-32B-Instruct