Nitro-E / README.md

Update README.md

8a15d46 verified 11 days ago

4.46 kB

	---
	license: mit
	pipeline_tag: text-to-image
	library_name: diffusers
	---
	# AMD Nitro-E


	![image/png](https://huggingface.co/amd/Nitro-E/resolve/main/assets/teaser.png)

	## Introduction
	Nitro-E is a family of text-to-image diffusion models focused on highly efficient training. With just 304M parameters, Nitro-E is designed to be resource-friendly for both training and inference. For training, it only takes 1.5 days on a single node with 8 AMD Instinct™ MI300X GPUs. On the inference side, Nitro-E delivers a throughput of 18.8 samples per second (batch size 32, 512px images) a single AMD Instinct MI300X GPU. The distilled version can further increase the throughput to 39.3 samples per second. The release consists of:

	* [Nitro-E-512px](https://huggingface.co/amd/Nitro-E/blob/main/Nitro-E-512px.safetensors): a EMMDiT-based 20-steps model train from scratch.
	* [Nitro-E-512px-dist](https://huggingface.co/amd/Nitro-E/blob/main/Nitro-E-512px-dist.safetensors): a EMMDiT-based model distilled from Nitro-E-512px.
	* [Nitro-E-512px-GRPO](https://huggingface.co/amd/Nitro-E/tree/main/ckpt_grpo_512px): a post-training model fine-tuned from Nitro-E-512px using Group Relative Policy Optimization (GRPO) strategy.

	⚡️ [Open-source code](https://github.com/AMD-AGI/Nitro-E)!
	⚡️ [technical blog](https://rocm.blogs.amd.com/artificial-intelligence/nitro-e/README.html)!


	## Details

	* Model architecture: We propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M
	parameters for fast image synthesis requiring low training resources. Our design philosophy centers on token reduction as the computational
	cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression
	module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence,
	and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an
	efficient lightweight module for computing modulation parameters in transformer blocks. See our technical blog post for more details.
	* Dataset: Our models were trained on a dataset of ~25M images consisting of both real and synthetic data sources that are openly available on the internet. We make use of the following datasets for training: [Segment-Anything-1B](https://ai.meta.com/datasets/segment-anything/), [JourneyDB](https://journeydb.github.io/), [DiffusionDB](https://github.com/poloclub/diffusiondb) and [DataComp](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) as prompt of the generated data.
	* Training cost: The Nitro-E-512px model requires only 1.5 days of training from scratch on a single node with 8 AMD Instinct™ MI300X GPUs.


	## Quickstart
	* Image generation with 20 steps:
	```python
	import torch
	from core.tools.inference_pipe import init_pipe

	device = torch.device('cuda:0')
	dtype = torch.bfloat16
	repo_name = "amd/Nitro-E"

	resolution = 512
	ckpt_name = 'Nitro-E-512px.safetensors'

	# for 1024px model
	# resolution = 1024
	# ckpt_name = 'Nitro-E-1024px.safetensors'

	use_grpo = True

	if use_grpo:
	pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name, ckpt_path_grpo='ckpt_grpo_512px')
	else:
	pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)
	prompt = 'A hot air balloon in the shape of a heart grand canyon'
	images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=20, guidance_scale=4.5).images
	```

	* Image generation with 4 steps:
	```python
	import torch
	from core.tools.inference_pipe import init_pipe

	device = torch.device('cuda:0')
	dtype = torch.bfloat16
	resolution = 512
	repo_name = "amd/Nitro-E"
	ckpt_name = 'Nitro-E-512px-dist.safetensors'

	pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)
	prompt = 'A hot air balloon in the shape of a heart grand canyon'

	images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=4, guidance_scale=0).images
	```


	## License

	Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved.

	This project is licensed under the [MIT License](https://mit-license.org/).