YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MotionAgent: Fine-grained Controllable Video Generation via
Motion Field Agent

International Conference on Computer Vision, ICCV 2025.

arXiv HuggingFace

Xinyao Liao1,2, Xianfang Zeng2, Liao Wang2, Gang Yu2*, Guosheng Lin1*, Chi Zhang3

1 Nanyang Technological University  2 StepFun  3 Westlake University

🧩 Overview

Pipeline of Motion Field Agent

MotionAgent is a novel framework that enables fine-grained motion control for text-guided image-to-video generation. At its core is a motion field agent that parses motion information in text prompts and converts it into explicit object trajectories and camera extrinsics. These motion representations are analytically integrated into a unified optical flow, which conditions a diffusion-based image-to-video model to generate videos with precise and flexible motion control. An optional rethinking step further refines motion alignment by iteratively correcting the agent’s previous actions.

πŸŽ₯ Demo

MotionAgent Demo Video
Click the image above to watch the full video on YouTube 🎬

πŸ› οΈ Dependencies and Installation

Follow the steps below to set up MotionAgent and run the demo smoothly πŸ’«

πŸ”Ή 1. Clone the Repository

Clone the official GitHub repository and enter the project directory:

git clone https://github.com/leoisufa/MotionAgent.git
cd MotionAgent

πŸ”Ή 2. Environment Setup

# Create and activate conda environment
conda create -n motionagent python==3.10 -y
conda activate motionagent

# Install PyTorch with CUDA 12.4 support
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124

# Install project dependencies
pip install -r requirements.txt

πŸ”Ή 3. Install Grounded-Segment-Anything Dependencies

MotionAgent relies on external segmentation and grounding models. Follow the steps below to install Grounded-Segment-Anything:

# Navigate to models directory
cd models

# Clone the Grounded-Segment-Anything repository
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git

# Enter the cloned directory
cd Grounded-Segment-Anything

# Install Segment Anything
python -m pip install -e segment_anything

# Install Grounding DINO
pip install --no-build-isolation -e GroundingDINO

πŸ”Ή 4. Install Metric3D Dependencies

MotionAgent relies on an external monocular depth estimation model. Follow the steps below to install Metric3D:

# Navigate to models directory
cd models

# Clone the Grounded-Segment-Anything repository
git clone https://github.com/YvanYin/Metric3D.git

🧱 Download Models

To run MotionAgent, please download all pretrained and auxiliary models listed below, and organize them under the ckpts/ directory as shown in the example structure.

1️⃣ Optical Flow ControlNet Weights

Download from πŸ‘‰ Hugging Face (MotionAgent) and place the files in ckpts.

2️⃣ Stable Video Diffusion

Download from πŸ‘‰ Hugging Face (MOFA-Video-Hybrid/stable-video-diffusion-img2vid-xt-1-1) and save the model to ckpts.

3️⃣ Grounding DINO

Download the grounding model checkpoint using the command below:

wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

Then place it directly under ckpts.

4️⃣ Segment Anything

Download the segmentation model using:

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Then place it under ckpts.

5️⃣ Metric Depth Estimator

Download from πŸ‘‰ Hugging Face (Metric3d) and place the files in ckpts.

6️⃣ CMP

Download from πŸ‘‰ Hugging Face (MOFA-Video-Hybrid/cmp) and save the model to models/cmp/experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints.

After all downloads and installations, your ckpts folder should look like this:

ckpts/
β”œβ”€β”€ controlnet/
β”œβ”€β”€ stable-video-diffusion-img2vid-xt-1-1/
β”œβ”€β”€ groundingdino_swint_ogc.pth
β”œβ”€β”€ metric_depth_vit_small_800k.pth
└── sam_vit_h_4b8939.pth

πŸš€ Running the Demos

python run_agent.py

πŸ”— BibTeX

If you find MotionAgent useful for your research and applications, please cite using this BibTeX:

@article{liao2025motionagent,
  title={Motionagent: Fine-grained controllable video generation via motion field agent},
  author={Liao, Xinyao and Zeng, Xianfang and Wang, Liao and Yu, Gang and Lin, Guosheng and Zhang, Chi},
  journal={arXiv preprint arXiv:2502.03207},
  year={2025}
}

πŸ™ Acknowledgements

We thank the following prior art for their excellent open source work:

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support