Nanosaur
Release v1.0
Nanosaur is a 542M parameter text-to-image model for illustrations. It consists of two trained components:
- PS-VAE: A VAE that compresses DINOv3's representation while retaining the semantic content
- DeCo: A diffusion transformer with a wide MLP head per patch similar to DDT
This release was trained in 8.5 days on 1 GPU from scratch, setting a new standard for compute efficiency in T2I illustration training.
This model is intended for research purposes. It is not a general purpose T2I model. It is not a finetune of an existing model. Do not expect visual quality to match corporate models.
Prompt format: tags or natural language
This repo includes a gradio GUI for image generation, a tech report, and training scripts for the VAE and for the diffusion model
Included Checkpoints
| Checkpoint | Path | Parameters | Description |
|---|---|---|---|
| PS-VAE | vae/nanosaur_psvae_v1.0.safetensors |
150M | VAE with DINOv3 encoder, 96-channel latent space, 16x spatial compression |
| DeCo | diffusion_model/nanosaur_deco_v1.0.safetensors |
542M | Diffusion transformer with SPRINT and x-prediction |
Text Encoder: Google Gemma 3 270M (downloaded from Hugging Face; you may need to agree to their terms to access the repo)
Model Architecture and Training Details
See TECH REPORT.md
Requirements
- Python 3.12+
- UV
- CUDA-capable GPU (6GB for 1024x1024 inference. 24GB VRAM recommended for training.)
Install dependencies:
uv sync
Quick Start
Image Generation (Inference)
Launch the Gradio web interface:
uv run gradio_app.py
This provides a web UI at http://localhost:7860 for text-to-image generation.
Lora Training
uv run train_lora/cache_lora_dataset.py --data_dir=/path/to/images_and_textfiles --cache_dir=cache_lora
uv run train_lora/train_lora.py --cache_dir=cache_lora
uv run gradio_app.py
VAE Training
The PS-VAE is trained in two stages:
- S-VAE Stage: Train semantic encoder/decoder with frozen DINOv3
- PS-VAE Stage: Fine-tune the full model including DINOv3 encoder
Train VAE
uv run train_vae/train_vae_afhq.py --stage both
Resume training:
uv run train_vae/train_vae_afhq.py --stage psvae --resume checkpoints_vae/svae_final.pt
VAE Inference
Reconstruct an image through the VAE:
uv run vae/inference_vae.py image.png --width 256 --height 256
Diffusion Model Training
Step 1: Cache Dataset
Pre-compute VAE latents and text embeddings for faster training:
uv run train_diffusion_model/cache_diffusion_afhq.py
This creates cache/ with pre-encoded image latents and text embeddings
Step 2: Train Diffusion Model
uv run train_diffusion_model/train_diffusion_afhq.py
Monitor training:
tensorboard --logdir runs_diffusion
6. Credits & References
| Component | Source |
|---|---|
| VAE Encoder | Initialized from timm/vit_base_patch16_dinov3.lvd1689m |
| VAE Decoder | Trained from scratch, based on VA-VAE |
| Text Encoder | Google Gemma 3 270M |
| Flow Matching | Based on minRF |
| SIGReg Loss | From LeJEPA |
- VAE Training Recipe PS-VAE
- DeCo Architecture: DeCo
- SPRINT: SPRINT Implementation code from: SpeedrunDiT
- X-prediction with V-loss: JiT
- DINOv3: Self-supervised Vision Transformers (Meta AI)