Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

This repository contains an OBR-quantized Llama-2-7B model, based on the research presented in the paper "Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs".

Paper

Code / GitHub Repository

The official implementation and training code can be found at: https://github.com/csguoh/OBR

Abstract

Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.

Highlights

The First to Enable W4A4KV4+50% Sparsity LLMs: OBR pushes the boundaries of LLM compression by enabling aggressive W4A4KV4 quantization with 50% unstructured sparsity.
Strong Performance on WikiText Perplexity and Zero-shot Evaluation: Achieves robust performance on WikiText Perplexity and various zero-shot evaluation benchmarks despite aggressive compression.
Promising Efficiency against Dense INT4 Baselines: Delivers significant efficiency gains, including up to 4.72x speedup and 6.4x memory reduction compared to FP16-dense baselines.

Usage

This model is a Llama-2-7B variant that has been processed with the OBR framework. It is designed to be compatible with the Hugging Face transformers library.

For detailed installation, usage instructions, and examples for applying the OBR framework to other base models or reproducing the results, please refer to the official GitHub repository.

Citation

If you find our work useful for your research, please cite the paper:

@misc{guo2025optimalbrainrestoration,
      title={Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs},
      author={Hang Guo and Yawei Li and Luca Benini},
      year={2025},
      eprint={2509.11177},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.11177},
}

License

Since this work is based on the previous works including QuaRot, SpinQuant, and FlatQuant., users should follow the license of the corresponding backbone model. This specific model, based on the QuaRot framework, is therefore under the Apache 2.0 License.

Downloads last month: 13

Collection including HangGuo/Llama2-7B-QuaRot-OBR-GPTQ-W4A4KV4S50

Optimal-Brain-Resotration

Collection

The model collection of paper: Optimal Brain Restoration for Joint Sparsification and Quantization of LLMs. Github: https://github.com/csguoh/OBR • 16 items • Updated Sep 26 • 2