# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

DreamOmni2 is a multimodal image generation and editing system that combines a Vision Language Model (VLM) with a diffusion pipeline. The system supports two primary modes:
- **Image Editing**: Modify existing images using instructions and reference images
- **Image Generation**: Create new images from multiple reference images and instructions

## Architecture

### Core Pipeline Flow

1. **VLM Processing** (Qwen2.5-VL): Converts user instructions + input images into detailed text prompts
2. **Diffusion Pipeline** (FLUX.1-Kontext): Generates/edits images based on VLM-generated prompts
3. **LoRA Adapters**: Separate adapters for editing vs generation tasks

### Key Components

- `dreamomni2/pipeline_dreamomni2.py`: Custom DreamOmni2Pipeline extending DiffusionPipeline
  - Handles multi-image conditioning via image latents and IDs
  - Supports FLUX.1-Kontext preferred resolutions (17 predefined aspect ratios)
  - Image latents are packed alongside text latents for joint attention

- `utils/vprocess.py`: Vision processing utilities
  - `process_vision_info()`: Extracts and processes images/videos from message format
  - `resizeinput()`: Resizes images to PREFERRED_KONTEXT_RESOLUTIONS
  - Handles various image input formats (local path, URL, base64, PIL.Image)

- `inference_edit.py`: CLI script for image editing tasks
  - Loads edit_lora adapter weights
  - Appends " It is editing task." prefix to instructions

- `inference_gen.py`: CLI script for image generation tasks
  - Loads gen_lora adapter weights
  - Appends " It is generation task." prefix to instructions

- `app.py`: Gradio web interface for Hugging Face Spaces deployment
  - Uses `@spaces.GPU(duration=90)` decorators for GPU allocation
  - Combines both editing and generation in single interface

## Running the Code

### Image Editing

```bash
python inference_edit.py \
  --vlm_path ./models/vlm-model \
  --edit_lora_path ./models/edit_lora \
  --base_model_path black-forest-labs/FLUX.1-Kontext-dev \
  --input_img_path src.jpg ref.jpg \
  --input_instruction "Your editing instruction here" \
  --output_path output.png
```

### Image Generation

```bash
python inference_gen.py \
  --vlm_path ./models/vlm-model \
  --gen_lora_path ./models/gen_lora \
  --base_model_path black-forest-labs/FLUX.1-Kontext-dev \
  --input_img_path img1.jpg img2.jpg \
  --input_instruction "Your generation instruction here" \
  --height 1024 \
  --width 1024 \
  --output_path output.png
```

### Gradio Web Interface

```bash
python app.py
```

Or for Hugging Face Spaces deployment, the app automatically launches from `app.py`.

## Important Technical Details

### Image Resolution Handling

- All images are automatically resized to one of 17 PREFERRED_KONTEXT_RESOLUTIONS
- Maintains aspect ratio while ensuring dimensions are divisible by 16
- Resolutions range from 672x1568 to 1568x672, with 1024x1024 for square images

### VLM Prompt Format

The VLM expects messages with:
- One or more images as `{"type": "image", "image": path}`
- Instruction text with task-specific suffix: `{"type": "text", "text": instruction + prefix}`
- Prefix is " It is editing task." or " It is generation task."

The VLM output is wrapped in `<box>...</box>` tags and extracted via `extract_gen_content()` which removes first 6 and last 7 characters.

### Multi-Image Conditioning

The pipeline handles multiple input images by:
1. Encoding each image to latents via VAE
2. Packing latents into 2x2 patches
3. Creating image IDs with unique index per image and offset positions
4. Concatenating image latents with noise latents during denoising

### NPU Support

The codebase includes optional NPU (Neural Processing Unit) support:
- Detects `torch_npu` availability at runtime
- Applies NPU-specific patches if available
- Falls back to CUDA if NPU is unavailable

## Model Dependencies

- Base model: `black-forest-labs/FLUX.1-Kontext-dev`
- VLM: Qwen2.5-VL (from `xiabs/DreamOmni2/vlm-model`)
- LoRA weights: edit_lora and gen_lora (from `xiabs/DreamOmni2`)
- All models use bfloat16 precision

## Device Configuration

The code automatically selects device:
- NPU if `torch_npu` is available
- CUDA otherwise
- VLM model uses `device_map="cuda"` hardcoded in inference scripts