MotionAgent: Fine-grained Controllable Video Generation via
Motion Field Agent

International Conference on Computer Vision, ICCV 2025.

arXiv HuggingFace

Xinyao Liao1,2, Xianfang Zeng2, Liao Wang2, Gang Yu2*, Guosheng Lin1*, Chi Zhang3

1 Nanyang Technological University  2 StepFun  3 Westlake University
## 🧩 Overview

Pipeline of Motion Field Agent

MotionAgent is a novel framework that enables **fine-grained motion control** for text-guided image-to-video generation. At its core is a **motion field agent** that parses motion information in text prompts and converts it into explicit *object trajectories* and *camera extrinsics*. These motion representations are analytically integrated into a unified optical flow, which conditions a diffusion-based image-to-video model to generate videos with precise and flexible motion control. An optional rethinking step further refines motion alignment by iteratively correcting the agent’s previous actions. ## πŸŽ₯ Demo

MotionAgent Demo Video
Click the image above to watch the full video on YouTube 🎬

## πŸ› οΈ Dependencies and Installation Follow the steps below to set up **MotionAgent** and run the demo smoothly πŸ’« ### πŸ”Ή 1. Clone the Repository Clone the official GitHub repository and enter the project directory: ```bash git clone https://github.com/leoisufa/MotionAgent.git cd MotionAgent ``` ### πŸ”Ή 2. Environment Setup ```bash # Create and activate conda environment conda create -n motionagent python==3.10 -y conda activate motionagent # Install PyTorch with CUDA 12.4 support pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124 # Install project dependencies pip install -r requirements.txt ``` ### πŸ”Ή 3. Install Grounded-Segment-Anything Dependencies MotionAgent relies on external segmentation and grounding models. Follow the steps below to install [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything): ```bash # Navigate to models directory cd models # Clone the Grounded-Segment-Anything repository git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git # Enter the cloned directory cd Grounded-Segment-Anything # Install Segment Anything python -m pip install -e segment_anything # Install Grounding DINO pip install --no-build-isolation -e GroundingDINO ``` ### πŸ”Ή 4. Install Metric3D Dependencies MotionAgent relies on an external monocular depth estimation model. Follow the steps below to install [Metric3D](https://github.com/YvanYin/Metric3D): ```bash # Navigate to models directory cd models # Clone the Grounded-Segment-Anything repository git clone https://github.com/YvanYin/Metric3D.git ``` ## 🧱 Download Models To run **MotionAgent**, please download all pretrained and auxiliary models listed below, and organize them under the `ckpts/` directory as shown in the example structure. ### 1️⃣ **Optical Flow ControlNet Weights** Download from πŸ‘‰ [Hugging Face (MotionAgent)](https://huggingface.co/leoisufa/MotionAgent) and place the files in `ckpts`. ### 2️⃣ **Stable Video Diffusion** Download from πŸ‘‰ [Hugging Face (MOFA-Video-Hybrid/stable-video-diffusion-img2vid-xt-1-1)](https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid/tree/main/ckpts/mofa/stable-video-diffusion-img2vid-xt-1-1) and save the model to `ckpts`. ### 3️⃣ **Grounding DINO** Download the grounding model checkpoint using the command below: ```bash wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth ``` Then place it directly under `ckpts`. ### 4️⃣ **Segment Anything** Download the segmentation model using: ```bash wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth ``` Then place it under `ckpts`. ### 5️⃣ **Metric Depth Estimator** Download from πŸ‘‰ [Hugging Face (Metric3d)](https://drive.google.com/file/d/1YfmvXwpWmhLg3jSxnhT7LvY0yawlXcr_/view?usp=drive_link) and place the files in `ckpts`. ### 6️⃣ **CMP** Download from πŸ‘‰ [Hugging Face (MOFA-Video-Hybrid/cmp)](https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid/resolve/main/models/cmp/experiments/semiauto_annot/resnet50_vip%2Bmpii_liteflow/checkpoints/ckpt_iter_42000.pth.tar) and save the model to `models/cmp/experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints`. After all downloads and installations, your ckpts folder should look like this: ```shell ckpts/ β”œβ”€β”€ controlnet/ β”œβ”€β”€ stable-video-diffusion-img2vid-xt-1-1/ β”œβ”€β”€ groundingdino_swint_ogc.pth β”œβ”€β”€ metric_depth_vit_small_800k.pth └── sam_vit_h_4b8939.pth ``` ## πŸš€ Running the Demos ```bash python run_agent.py ``` ## πŸ”— BibTeX If you find [MotionAgent](https://arxiv.org/abs/2502.03207) useful for your research and applications, please cite using this BibTeX: ```BibTeX @article{liao2025motionagent, title={Motionagent: Fine-grained controllable video generation via motion field agent}, author={Liao, Xinyao and Zeng, Xianfang and Wang, Liao and Yu, Gang and Lin, Guosheng and Zhang, Chi}, journal={arXiv preprint arXiv:2502.03207}, year={2025} } ``` ## πŸ™ Acknowledgements We thank the following prior art for their excellent open source work: - [MOFA-Video](https://github.com/MyNiuuu/MOFA-Video) - [AppAgent](https://github.com/TencentQQGYLab/AppAgent) - [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything) - [Metric3D](https://github.com/YvanYin/Metric3D)