|
|
--- |
|
|
license: mit |
|
|
pipeline_tag: text-to-image |
|
|
library_name: diffusers |
|
|
--- |
|
|
# AMD Nitro-E |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
## Introduction |
|
|
Nitro-E is a family of text-to-image diffusion models focused on highly efficient training. With just 304M parameters, Nitro-E is designed to be resource-friendly for both training and inference. For training, it only takes 1.5 days on a single node with 8 AMD Instinct™ MI300X GPUs. On the inference side, Nitro-E delivers a throughput of 18.8 samples per second (batch size 32, 512px images) a single AMD Instinct MI300X GPU. The distilled version can further increase the throughput to 39.3 samples per second. The release consists of: |
|
|
|
|
|
* [Nitro-E-512px](https://huggingface.co/amd/Nitro-E/blob/main/Nitro-E-512px.safetensors): a EMMDiT-based 20-steps model train from scratch. |
|
|
* [Nitro-E-512px-dist](https://huggingface.co/amd/Nitro-E/blob/main/Nitro-E-512px-dist.safetensors): a EMMDiT-based model distilled from Nitro-E-512px. |
|
|
* [Nitro-E-512px-GRPO](https://huggingface.co/amd/Nitro-E/tree/main/ckpt_grpo_512px): a post-training model fine-tuned from Nitro-E-512px using Group Relative Policy Optimization (GRPO) strategy. |
|
|
|
|
|
⚡️ [Open-source code](https://github.com/AMD-AGI/Nitro-E)! |
|
|
⚡️ [technical blog](https://rocm.blogs.amd.com/artificial-intelligence/nitro-e/README.html)! |
|
|
|
|
|
|
|
|
## Details |
|
|
|
|
|
* **Model architecture**: We propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M |
|
|
parameters for fast image synthesis requiring low training resources. Our design philosophy centers on token reduction as the computational |
|
|
cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression |
|
|
module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, |
|
|
and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an |
|
|
efficient lightweight module for computing modulation parameters in transformer blocks. See our technical blog post for more details. |
|
|
* **Dataset**: Our models were trained on a dataset of ~25M images consisting of both real and synthetic data sources that are openly available on the internet. We make use of the following datasets for training: [Segment-Anything-1B](https://ai.meta.com/datasets/segment-anything/), [JourneyDB](https://journeydb.github.io/), [DiffusionDB](https://github.com/poloclub/diffusiondb) and [DataComp](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) as prompt of the generated data. |
|
|
* **Training cost**: The Nitro-E-512px model requires only 1.5 days of training from scratch on a single node with 8 AMD Instinct™ MI300X GPUs. |
|
|
|
|
|
|
|
|
## Quickstart |
|
|
* **Image generation with 20 steps**: |
|
|
```python |
|
|
import torch |
|
|
from core.tools.inference_pipe import init_pipe |
|
|
|
|
|
device = torch.device('cuda:0') |
|
|
dtype = torch.bfloat16 |
|
|
repo_name = "amd/Nitro-E" |
|
|
|
|
|
resolution = 512 |
|
|
ckpt_name = 'Nitro-E-512px.safetensors' |
|
|
|
|
|
# for 1024px model |
|
|
# resolution = 1024 |
|
|
# ckpt_name = 'Nitro-E-1024px.safetensors' |
|
|
|
|
|
use_grpo = True |
|
|
|
|
|
if use_grpo: |
|
|
pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name, ckpt_path_grpo='ckpt_grpo_512px') |
|
|
else: |
|
|
pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name) |
|
|
prompt = 'A hot air balloon in the shape of a heart grand canyon' |
|
|
images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=20, guidance_scale=4.5).images |
|
|
``` |
|
|
|
|
|
* **Image generation with 4 steps**: |
|
|
```python |
|
|
import torch |
|
|
from core.tools.inference_pipe import init_pipe |
|
|
|
|
|
device = torch.device('cuda:0') |
|
|
dtype = torch.bfloat16 |
|
|
resolution = 512 |
|
|
repo_name = "amd/Nitro-E" |
|
|
ckpt_name = 'Nitro-E-512px-dist.safetensors' |
|
|
|
|
|
pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name) |
|
|
prompt = 'A hot air balloon in the shape of a heart grand canyon' |
|
|
|
|
|
images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=4, guidance_scale=0).images |
|
|
``` |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved. |
|
|
|
|
|
This project is licensed under the [MIT License](https://mit-license.org/). |
|
|
|