Depth Anything 3: DA3NESTED-GIANT-LARGE

Project Page Paper Demo Benchmark

Model Description

DA3 Nested model combining the any-view Giant model with the metric Large model for metric-scale visual geometry reconstruction. This is our recommended model that combines all capabilities.

Property Value
Model Series Nested
Parameters 1.40B
License CC BY-NC 4.0

⚠️ Non-commercial use only due to CC BY-NC 4.0 license.

Capabilities

  • βœ… Relative Depth
  • βœ… Pose Estimation
  • βœ… Pose Conditioning
  • βœ… 3D Gaussians
  • βœ… Metric Depth
  • βœ… Sky Segmentation

Quick Start

Installation

pip install depth-anything-3

Basic Example

import torch
from depth_anything_3.api import DepthAnything3

# Load model from Hugging Face Hub
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DepthAnything3.from_pretrained("depth-anything/da3nested-giant-large")
model = model.to(device=device)

# Run inference on images
images = ["image1.jpg", "image2.jpg"]  # List of image paths, PIL Images, or numpy arrays
prediction = model.inference(
    images,
    export_dir="output",
    export_format="glb"  # Options: glb, npz, ply, mini_npz, gs_ply, gs_video
)

# Access results
print(prediction.depth.shape)        # Depth maps: [N, H, W] float32
print(prediction.conf.shape)         # Confidence maps: [N, H, W] float32
print(prediction.extrinsics.shape)   # Camera poses (w2c): [N, 3, 4] float32
print(prediction.intrinsics.shape)   # Camera intrinsics: [N, 3, 3] float32

Command Line Interface

# Process images with auto mode
da3 auto path/to/images \
    --export-format glb \
    --export-dir output \
    --model-dir depth-anything/da3nested-giant-large

# Use backend for faster repeated inference
da3 backend --model-dir depth-anything/da3nested-giant-large
da3 auto path/to/images --export-format glb --use-backend

Model Details

  • Developed by: ByteDance Seed Team
  • Model Type: Vision Transformer for Visual Geometry
  • Architecture: Plain transformer with unified depth-ray representation
  • Training Data: Public academic datasets only

Key Insights

πŸ’Ž A single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization.

✨ A singular depth-ray representation obviates the need for complex multi-task learning.

Performance

πŸ† Depth Anything 3 significantly outperforms:

  • Depth Anything 2 for monocular depth estimation
  • VGGT for multi-view depth estimation and pose estimation

For detailed benchmarks, please refer to our paper and Visual Geometry Benchmark.

Limitations

  • The model is trained on academic datasets and may have limitations on certain domain-specific images
  • Performance may vary depending on image quality, lighting conditions, and scene complexity
  • ⚠️ Non-commercial use only due to CC BY-NC 4.0 license.

Citation

If you find Depth Anything 3 useful in your research or projects, please cite:

@article{depthanything3,
  title={Depth Anything 3: Recovering the visual space from any views},
  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Links

Authors

Haotong Lin Β· Sili Chen Β· Junhao Liew Β· Donny Y. Chen Β· Zhenyu Li Β· Guang Shi Β· Jiashi Feng Β· Bingyi Kang

Downloads last month
43,240
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Spaces using depth-anything/DA3NESTED-GIANT-LARGE 5

Collection including depth-anything/DA3NESTED-GIANT-LARGE