Zip2Zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

Python 3.10+ arXiv Hugging Face License

Overview

Zip2Zip brings inference-time adaptive tokenization to large language models (LLMs) using online LZW token compression.
Instead of relying on a fixed vocabulary, Zip2Zip dynamically merges frequently co-occurring token sequences into hypertokens during inference.
This allows the model to adapt its vocabulary to each context, reducing the number of tokens needed and thus speeding up generation.

Uptraining

This model was uptrained from microsoft/Phi-3.5-mini-instruct using parameter-efficient finetuning (LoRA) for ~10 GPU-hours.
During uptraining, the model learned to operate on LZW-compressed sequences while preserving semantic reconstruction via an auxiliary autoencoding loss.
This allows it to generate and interpret hypertokens seamlessly during inference.

Features

  • Dynamic vocabulary expansion during inference
  • LZW-based online token compression for input + output
  • Hyper-encoder module computes embeddings for new hypertokens
  • Plug-and-play integration with Hugging Face Transformers
  • Compatible with PEFT / LoRA / 8-bit quantization

Installation

pip install zip2zip

Usage

Same API as Hugging Face

import torch
from zip2zip import Zip2ZipModel, Zip2ZipTokenizer

pretrained_model_url = "epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = Zip2ZipTokenizer.from_pretrained(pretrained_model_url)
model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map=device)

inputs = tokenizer("Write a MultiHeadAttention layer in PyTorch", return_tensors="pt").to(device)
outputs = model.generate(**inputs)

print(tokenizer.color_decode(outputs))

Quantized inference

model = Zip2ZipModel.from_pretrained(pretrained_model_url, device_map="auto", load_in_8bit=True)

Training Details

  • Base: Phi-3.5-mini-instruct
  • Method: Continued pretraining with LoRA (parameter-efficient finetuning)
  • Data: 100 M tokens (The Pile, C4, Paloma mC4/dC4)
  • Objectives:
    • Causal LM loss on LZW-compressed sequences
    • Auxiliary reconstruction loss (λ = 0.1)
  • Max merge size: 3
  • Precision: bf16 mixed

Citation

@misc{geng2025zip2zipinferencetimeadaptivetokenization,
      title={zip2zip: Inference-Time Adaptive Tokenization via Online Compression}, 
      author={Saibo Geng and Nathan Ranchin and Yunzhen yao and Maxime Peyrard and Chris Wendler and Michael Gastpar and Robert West},
      year={2025},
      eprint={2506.01084},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01084}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1

Finetuned
(105)
this model

Dataset used to train epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1

Collection including epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1