Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Thinking with Camera

Paper

This model was presented in the paper Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation.

Abstract

Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance.

Links

Model Details

Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the camera-centric understanding and generation tasks in a unified multimodal framework. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.

Developed by Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy
Affiliation S-Lab, Nanyang Technological University
First released arXiv pre-print, 2025
Model type Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation)
Modality Image β†’ Text+Camera; Text+Camera β†’ Image; Image+Camera β†’ Image; Image+Camera β†’ Text

Direct Use

  • Camera-centric understanding and generation from a single image or a pair of text and camera, supports the thinking mode.
  • World exploration: performs the cross-view generation from a given initial view and target camera configuration.
  • Spatial imagination: imagines the scene description based on an initial view and target camera configuration.
  • 3D virtual object insertion in AR/VR: assists the virtual 3D object insertion into in-the-wild images by calibrating camera parameters

Sample Usage

This section demonstrates how to generate images with camera control using Puffin-Base, based on the examples provided in the GitHub repository.

First, download the model checkpoints from πŸ€— KangLiao/Puffin and organize them in a checkpoints directory, for example:

Puffin/
β”œβ”€β”€ checkpoints
    β”œβ”€β”€ Puffin-Align.pth # provided for customized SFT
    β”œβ”€β”€ Puffin-Base.pth
    β”œβ”€β”€ Puffin-Thinking.pth
    β”œβ”€β”€ Puffin-Instruct.pth

You can use huggingface-cli to download the checkpoints:

# pip install -U "huggingface_hub[cli]"
huggingface-cli download KangLiao/Puffin  --local-dir checkpoints --repo-type model

To run the camera-controllable image generation:

export PYTHONPATH=./:$PYTHONPATH
python scripts/demo/generation.py configs/pipelines/stage_2_base.py \
          --checkpoint checkpoints/Puffin-Base.pth --output generation_result.jpg \
          --prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
          -r -0.3939 -p 0.0277 -f 0.7595

This command generates an image based on the provided text prompt and camera parameters (roll: -r, pitch: -p, vertical field-of-view: -f, all in radians). The output image will be saved as generation_result.jpg.

To enable the thinking mode for image generation, please simply change the settings and append the --thinking flag:

python scripts/demo/generation.py configs/pipelines/stage_3_thinking.py \
          --checkpoint checkpoints/Puffin-Thinking.pth --output generation_result_thinking.jpg \
          --prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
          -r -0.3939 -p 0.0277 -f 0.7595 \
          --thinking

Citation

If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:

  @article{liao2025puffin,
    title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
    author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
    journal={arXiv preprint arXiv:2510.08673},
    year={2025}
  }

License

This project is licensed under NTU S-Lab License 1.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using KangLiao/Puffin 2

Collection including KangLiao/Puffin