Zip2Zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression
Overview
Zip2Zip brings inference-time adaptive tokenization to large language models (LLMs) using online LZW token compression.
Instead of relying on a fixed vocabulary, Zip2Zip dynamically merges frequently co-occurring token sequences into hypertokens during inference.
This allows the model to adapt its vocabulary to each context, reducing the number of tokens needed and thus speeding up generation.
- Paper: zip2zip: Inference-Time Adaptive Tokenization via Online Compression (NeurIPS 2025)
- Repository: https://github.com/epfl-dlab/zip2zip
- Base model:
microsoft/Phi-3.5-mini-instruct
Uptraining
This model was uptrained from microsoft/Phi-3.5-mini-instruct using parameter-efficient finetuning (LoRA) for ~10 GPU-hours.
During uptraining, the model learned to operate on LZW-compressed sequences while preserving semantic reconstruction via an auxiliary autoencoding loss.
This allows it to generate and interpret hypertokens seamlessly during inference.
Features
- Dynamic vocabulary expansion during inference
- LZW-based online token compression for input + output
- Hyper-encoder module computes embeddings for new hypertokens
- Plug-and-play integration with Hugging Face Transformers
- Compatible with PEFT / LoRA / 8-bit quantization
Installation
pip install zip2zip
Usage
Same API as Hugging Face
import torch
from zip2zip import Zip2ZipModel, Zip2ZipTokenizer
pretrained_model_url = "epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = Zip2ZipTokenizer.from_pretrained(pretrained_model_url)
model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map=device)
inputs = tokenizer("Write a MultiHeadAttention layer in PyTorch", return_tensors="pt").to(device)
outputs = model.generate(**inputs)
print(tokenizer.color_decode(outputs))
Quantized inference
model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map="auto", load_in_8bit=True)
Training Details
- Base: Phi-3.5-mini-instruct
- Method: Continued pretraining with LoRA (parameter-efficient finetuning)
- Data: 100 M tokens (The Pile, C4, Paloma mC4/dC4)
- Objectives:
- Causal LM loss on LZW-compressed sequences
- Auxiliary reconstruction loss (λ = 0.1)
- Max merge size: 3
- Precision: bf16 mixed
Citation
@misc{geng2025zip2zipinferencetimeadaptivetokenization,
title={zip2zip: Inference-Time Adaptive Tokenization via Online Compression},
author={Saibo Geng and Nathan Ranchin and Yunzhen yao and Maxime Peyrard and Chris Wendler and Michael Gastpar and Robert West},
year={2025},
eprint={2506.01084},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01084},
}
Model tree for epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1
Base model
microsoft/Phi-3.5-mini-instruct