Zip2Zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

Overview

Zip2Zip brings inference-time adaptive tokenization to large language models (LLMs) using online LZW token compression.
Instead of relying on a fixed vocabulary, Zip2Zip dynamically merges frequently co-occurring token sequences into hypertokens during inference.
This allows the model to adapt its vocabulary to each context, reducing the number of tokens needed and thus speeding up generation.

Paper: zip2zip: Inference-Time Adaptive Tokenization via Online Compression (NeurIPS 2025)
Repository: https://github.com/epfl-dlab/zip2zip
Base model: microsoft/Phi-3.5-mini-instruct

Uptraining

This model was uptrained from microsoft/Phi-3.5-mini-instruct using parameter-efficient finetuning (LoRA) for ~10 GPU-hours.
During uptraining, the model learned to operate on LZW-compressed sequences while preserving semantic reconstruction via an auxiliary autoencoding loss.
This allows it to generate and interpret hypertokens seamlessly during inference.

Features

Dynamic vocabulary expansion during inference
LZW-based online token compression for input + output
Hyper-encoder module computes embeddings for new hypertokens
Plug-and-play integration with Hugging Face Transformers
Compatible with PEFT / LoRA / 8-bit quantization

Installation

pip install zip2zip

Usage

Same API as Hugging Face

import torch
from zip2zip import Zip2ZipModel, Zip2ZipTokenizer

pretrained_model_url = "epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = Zip2ZipTokenizer.from_pretrained(pretrained_model_url)
model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map=device)

inputs = tokenizer("Write a MultiHeadAttention layer in PyTorch", return_tensors="pt").to(device)
outputs = model.generate(**inputs)

print(tokenizer.color_decode(outputs))

Quantized inference

model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map="auto", load_in_8bit=True)

Training Details

Base: Phi-3.5-mini-instruct
Method: Continued pretraining with LoRA (parameter-efficient finetuning)
Data: 100 M tokens (The Pile, C4, Paloma mC4/dC4)
Objectives:
- Causal LM loss on LZW-compressed sequences
- Auxiliary reconstruction loss (λ = 0.1)
Max merge size: 3
Precision: bf16 mixed

Citation

@misc{geng2025zip2zipinferencetimeadaptivetokenization,
      title={zip2zip: Inference-Time Adaptive Tokenization via Online Compression}, 
      author={Saibo Geng and Nathan Ranchin and Yunzhen yao and Maxime Peyrard and Chris Wendler and Michael Gastpar and Robert West},
      year={2025},
      eprint={2506.01084},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01084}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1

Base model

microsoft/Phi-3.5-mini-instruct

Finetuned

(105)

this model

Dataset used to train epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1

Collection including epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1

zip2zip Models

Collection

5 items • Updated Jun 18