Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs
This repository contains an OBR-quantized Llama-2-7B model, based on the research presented in the paper "Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs".
Paper
Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs
Code / GitHub Repository
The official implementation and training code can be found at: https://github.com/csguoh/OBR
Abstract
Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.
Highlights
- The First to Enable W4A4KV4+50% Sparsity LLMs: OBR pushes the boundaries of LLM compression by enabling aggressive W4A4KV4 quantization with 50% unstructured sparsity.
- Strong Performance on WikiText Perplexity and Zero-shot Evaluation: Achieves robust performance on WikiText Perplexity and various zero-shot evaluation benchmarks despite aggressive compression.
- Promising Efficiency against Dense INT4 Baselines: Delivers significant efficiency gains, including up to 4.72x speedup and 6.4x memory reduction compared to FP16-dense baselines.
Usage
This model is a Llama-2-7B variant that has been processed with the OBR framework. It is designed to be compatible with the Hugging Face transformers library.
For detailed installation, usage instructions, and examples for applying the OBR framework to other base models or reproducing the results, please refer to the official GitHub repository.
Citation
If you find our work useful for your research, please cite the paper:
@misc{guo2025optimalbrainrestoration,
title={Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs},
author={Hang Guo and Yawei Li and Luca Benini},
year={2025},
eprint={2509.11177},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.11177},
}
License
Since this work is based on the previous works including QuaRot, SpinQuant, and FlatQuant., users should follow the license of the corresponding backbone model. This specific model, based on the QuaRot framework, is therefore under the Apache 2.0 License.
- Downloads last month
- 13