Nanosaur

Release v1.0

Nanosaur is a 542M parameter text-to-image model for illustrations. It consists of two trained components:

PS-VAE: A VAE that compresses DINOv3's representation while retaining the semantic content
DeCo: A diffusion transformer with a wide MLP head per patch similar to DDT

This release was trained in 8.5 days on 1 GPU from scratch, setting a new standard for compute efficiency in T2I illustration training.

This model is intended for research purposes. It is not a general purpose T2I model. It is not a finetune of an existing model. Do not expect visual quality to match corporate models.

Prompt format: tags or natural language

This repo includes a gradio GUI for image generation, a tech report, and training scripts for the VAE and for the diffusion model

Included Checkpoints

Checkpoint	Path	Parameters	Description
PS-VAE	`vae/nanosaur_psvae_v1.0.safetensors`	150M	VAE with DINOv3 encoder, 96-channel latent space, 16x spatial compression
DeCo	`diffusion_model/nanosaur_deco_v1.0.safetensors`	542M	Diffusion transformer with SPRINT and x-prediction

Text Encoder: Google Gemma 3 270M (downloaded from Hugging Face; you may need to agree to their terms to access the repo)

Model Architecture and Training Details

See TECH REPORT.md

Requirements

Python 3.12+
UV
CUDA-capable GPU (6GB for 1024x1024 inference. 24GB VRAM recommended for training.)

Install dependencies:

uv sync

Quick Start

Image Generation (Inference)

Launch the Gradio web interface:

uv run gradio_app.py

This provides a web UI at http://localhost:7860 for text-to-image generation.

Lora Training

uv run train_lora/cache_lora_dataset.py --data_dir=/path/to/images_and_textfiles --cache_dir=cache_lora
uv run train_lora/train_lora.py --cache_dir=cache_lora
uv run gradio_app.py

VAE Training

The PS-VAE is trained in two stages:

S-VAE Stage: Train semantic encoder/decoder with frozen DINOv3
PS-VAE Stage: Fine-tune the full model including DINOv3 encoder

Train VAE

uv run train_vae/train_vae_afhq.py --stage both

Resume training:

uv run train_vae/train_vae_afhq.py --stage psvae --resume checkpoints_vae/svae_final.pt

VAE Inference

Reconstruct an image through the VAE:

uv run vae/inference_vae.py image.png --width 256 --height 256

Diffusion Model Training

Step 1: Cache Dataset

Pre-compute VAE latents and text embeddings for faster training:

uv run train_diffusion_model/cache_diffusion_afhq.py

This creates cache/ with pre-encoded image latents and text embeddings

Step 2: Train Diffusion Model

uv run train_diffusion_model/train_diffusion_afhq.py

Monitor training:

tensorboard --logdir runs_diffusion

6. Credits & References

Component	Source
VAE Encoder	Initialized from timm/vit_base_patch16_dinov3.lvd1689m
VAE Decoder	Trained from scratch, based on VA-VAE
Text Encoder	Google Gemma 3 270M
Flow Matching	Based on minRF
SIGReg Loss	From LeJEPA

VAE Training Recipe PS-VAE
DeCo Architecture: DeCo
SPRINT: SPRINT Implementation code from: SpeedrunDiT
X-prediction with V-loss: JiT
DINOv3: Self-supervised Vision Transformers (Meta AI)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for well9472/Nanosaur

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Paper • 2512.17909 • Published Dec 19, 2025 • 37