Update README.md

b810f7e verified 12 months ago

5.46 kB

	# Scaling Image Tokenizers with Grouped Spherical Quantization
	---

	[Paper link](https://arxiv.org/abs/2412.02632) \| [GITHUB REPO](https://github.com/HelmholtzAI-FZJ/flex_gen) [HF Checkpoints](https://huggingface.co/collections/HelmholtzAI-FZJ/grouped-spherical-quantization-674d6f9f548e472d0eaf179e)

	In [GSQ](https://arxiv.org/abs/2412.02632), we show the optimized training hyper-parameters and configs for quantization based image tokenizer. We also show how to scale the latent, vocab size etc. appropriately to achieve better reconstruction performance.

	![dim-vocab-scaling.png](./https://github.com/HelmholtzAI-FZJ/flex_gen/raw/main/figures/dim-vocab-scaling.png)

	We also show how to scaling the latent (and group) appropriately when pursuing high down-sample ratio in compression.

	![spatial_scale.png](./https://github.com/HelmholtzAI-FZJ/flex_gen/raw/main/figures/spatial_scale.png)

	The group scaling experiment of GSQ:

	---
	\| Models \| $ G $\times$ d $ \| rFID ↓ \| IS ↑ \| LPIPS ↓ \| PSNR ↑ \| SSIM ↑ \| Usage ↑ \| PPL ↑ \|
	\|--------------------------------------\|---------------------\|------------\|----------\|-------------\|------------\|------------\|-------------\|-------------\|
	\| GSQ F8-D64 $ V=8K $ \| $ 1 $\times$ 64 $ \| 0.63 \| 205 \| 0.08 \| 22.95 \| 0.67 \| 99.87% \| 8,055 \|
	\| \| $ 2 $\times$ 32 $ \| 0.32 \| 220 \| 0.05 \| 25.42 \| 0.76 \| 100% \| 8,157 \|
	\| \| $ 4 $\times$ 16 $ \| 0.18 \| 226 \| 0.03 \| 28.02 \| 0.08 \| 100% \| 8,143 \|
	\| \| $ 16 $\times$ 4 $ \| 0.03 \| 233 \| 0.004 \| 34.61 \| 0.91 \| 99.98% \| 6,775 \|
	\| GSQ F16-D16 $ V=256K $ \| $ 1 $\times$ 16 $ \| 1.63 \| 179 \| 0.13 \| 20.70 \| 0.56 \| 100% \| 254,044 \|
	\| \| $ 2 $\times$ 8 $ \| 0.82 \| 199 \| 0.09 \| 22.20 \| 0.63 \| 100% \| 257,273 \|
	\| \| $ 4 $\times$ 4 $ \| 0.74 \| 202 \| 0.08 \| 22.75 \| 0.63 \| 62.46% \| 43,767 \|
	\| \| $ 8 $\times$ 2 $ \| 0.50 \| 211 \| 0.06 \| 23.62 \| 0.66 \| 46.83% \| 22,181 \|
	\| \| $ 16 $\times$ 1 $ \| 0.52 \| 210 \| 0.06 \| 23.54 \| 0.66 \| 50.81% \| 181 \|
	\| \| $ 16 $\times$ 1^* $ \| 0.51 \| 210 \| 0.06 \| 23.52 \| 0.66 \| 52.64% \| 748 \|
	\| GSQ F32-D32 $ V=256K $ \| $ 1 $\times$ 32 $ \| 6.84 \| 95 \| 0.24 \| 17.83 \| 0.40 \| 100% \| 245,715 \|
	\| \| $ 2 $\times$ 16 $ \| 3.31 \| 139 \| 0.18 \| 19.01 \| 0.47 \| 100% \| 253,369 \|
	\| \| $ 4 $\times$ 8 $ \| 1.77 \| 173 \| 0.13 \| 20.60 \| 0.53 \| 100% \| 253,199 \|
	\| \| $ 8 $\times$ 4 $ \| 1.67 \| 176 \| 0.12 \| 20.88 \| 0.54 \| 59% \| 40,307 \|
	\| \| $ 16 $\times$ 2 $ \| 1.13 \| 190 \| 0.10 \| 21.73 \| 0.57 \| 46% \| 30,302 \|
	\| \| $ 32 $\times$ 1 $ \| 1.21 \| 187 \| 0.10 \| 21.64 \| 0.57 \| 54% \| 247 \|
	---


	## Use Pre-trained GSQ-Tokenizer

	```python
	from flex_gen import autoencoders
	from timm import create_model

	# ============= From HF's repo
	model=create_model('flexTokenizer', pretrained=True,
	repo_id='HelmholtzAI-FZJ/GSQ-F8-D8-V64k',)

	# ============= From Local Checkpoint
	model=create_model('flexTokenizer', pretrained=True,
	path='PATH/your_checkpoint.pt', )
	```

	---

	## Training your tokenizer

	### Set-up Python Virtual Environment

	```python
	sh gen_env/setup.sh

	source ./gen_env/activate.sh

	#! This will run pip install to download all required lib
	sh ./gen_env/install_requirements.sh

	```

	### Run Training

	```python
	# Single GPU
	python -W ignore ./scripts/train_autoencoder.py

	# Multi GPU
	torchrun --nnodes=1 --nproc_per_node=4 ./scripts/train_autoencoder.py --config-file=PATH/config_name.yaml \
	--output_dir=./logs_test/test opts train.num_train_steps=100 train_batch_size=16
	```

	### Run Evaluation

	Add the checkpoint path that your want to test in `evaluation/run_tokenizer_eval.sh`

	```bash
	# For example
	...
	configs_of_training_lists=()
	configs_of_training_lists=("logs_test/test/")
	...
	```

	And run `sh evaluation/run_tokenizer_eval.sh` it will automatically scan `folder/model/eval_xxx.pth` for tokenizer evaluation

	---

	# Citation

	```bash
	@misc{GSQ,
	title={Scaling Image Tokenizers with Grouped Spherical Quantization},
	author={Jiangtao Wang and Zhen Qin and Yifan Zhang and Vincent Tao Hu and Björn Ommer and Rania Briq and Stefan Kesselheim},
	year={2024},
	eprint={2412.02632},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2412.02632},
	}
	```