Nanosaur

Release v1.0

Nanosaur is a 542M parameter text-to-image model for illustrations. It consists of two trained components:

  1. PS-VAE: A VAE that compresses DINOv3's representation while retaining the semantic content
  2. DeCo: A diffusion transformer with a wide MLP head per patch similar to DDT

This release was trained in 8.5 days on 1 GPU from scratch, setting a new standard for compute efficiency in T2I illustration training.

This model is intended for research purposes. It is not a general purpose T2I model. It is not a finetune of an existing model. Do not expect visual quality to match corporate models.

Prompt format: tags or natural language

This repo includes a gradio GUI for image generation, a tech report, and training scripts for the VAE and for the diffusion model

Included Checkpoints

Checkpoint Path Parameters Description
PS-VAE vae/nanosaur_psvae_v1.0.safetensors 150M VAE with DINOv3 encoder, 96-channel latent space, 16x spatial compression
DeCo diffusion_model/nanosaur_deco_v1.0.safetensors 542M Diffusion transformer with SPRINT and x-prediction

Text Encoder: Google Gemma 3 270M (downloaded from Hugging Face; you may need to agree to their terms to access the repo)

Model Architecture and Training Details

See TECH REPORT.md

Requirements

  • Python 3.12+
  • UV
  • CUDA-capable GPU (6GB for 1024x1024 inference. 24GB VRAM recommended for training.)

Install dependencies:

uv sync

Quick Start

Image Generation (Inference)

Launch the Gradio web interface:

uv run gradio_app.py

This provides a web UI at http://localhost:7860 for text-to-image generation.


Lora Training

uv run train_lora/cache_lora_dataset.py --data_dir=/path/to/images_and_textfiles --cache_dir=cache_lora
uv run train_lora/train_lora.py --cache_dir=cache_lora
uv run gradio_app.py

VAE Training

The PS-VAE is trained in two stages:

  1. S-VAE Stage: Train semantic encoder/decoder with frozen DINOv3
  2. PS-VAE Stage: Fine-tune the full model including DINOv3 encoder

Train VAE

uv run train_vae/train_vae_afhq.py --stage both

Resume training:

uv run train_vae/train_vae_afhq.py --stage psvae --resume checkpoints_vae/svae_final.pt

VAE Inference

Reconstruct an image through the VAE:

uv run vae/inference_vae.py image.png --width 256 --height 256

Diffusion Model Training

Step 1: Cache Dataset

Pre-compute VAE latents and text embeddings for faster training:

uv run train_diffusion_model/cache_diffusion_afhq.py

This creates cache/ with pre-encoded image latents and text embeddings

Step 2: Train Diffusion Model

uv run train_diffusion_model/train_diffusion_afhq.py

Monitor training:

tensorboard --logdir runs_diffusion

6. Credits & References

Component Source
VAE Encoder Initialized from timm/vit_base_patch16_dinov3.lvd1689m
VAE Decoder Trained from scratch, based on VA-VAE
Text Encoder Google Gemma 3 270M
Flow Matching Based on minRF
SIGReg Loss From LeJEPA
  • VAE Training Recipe PS-VAE
  • DeCo Architecture: DeCo
  • SPRINT: SPRINT Implementation code from: SpeedrunDiT
  • X-prediction with V-loss: JiT
  • DINOv3: Self-supervised Vision Transformers (Meta AI)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for well9472/Nanosaur