haotongl's picture
Upload folder using huggingface_hub
8615eef verified
metadata
license: cc-by-nc-4.0
tags:
  - depth-estimation
  - computer-vision
  - monocular-depth
  - multi-view-geometry
  - pose-estimation
library_name: depth-anything-3
pipeline_tag: depth-estimation

Depth Anything 3: DA3NESTED-GIANT-LARGE

Project Page Paper Demo Benchmark

Model Description

DA3 Nested model combining the any-view Giant model with the metric Large model for metric-scale visual geometry reconstruction. This is our recommended model that combines all capabilities.

Property Value
Model Series Nested
Parameters 1.40B
License CC BY-NC 4.0

⚠️ Non-commercial use only due to CC BY-NC 4.0 license.

Capabilities

  • βœ… Relative Depth
  • βœ… Pose Estimation
  • βœ… Pose Conditioning
  • βœ… 3D Gaussians
  • βœ… Metric Depth
  • βœ… Sky Segmentation

Quick Start

Installation

pip install depth-anything-3

Basic Example

import torch
from depth_anything_3.api import DepthAnything3

# Load model from Hugging Face Hub
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DepthAnything3.from_pretrained("depth-anything/da3nested-giant-large")
model = model.to(device=device)

# Run inference on images
images = ["image1.jpg", "image2.jpg"]  # List of image paths, PIL Images, or numpy arrays
prediction = model.inference(
    images,
    export_dir="output",
    export_format="glb"  # Options: glb, npz, ply, mini_npz, gs_ply, gs_video
)

# Access results
print(prediction.depth.shape)        # Depth maps: [N, H, W] float32
print(prediction.conf.shape)         # Confidence maps: [N, H, W] float32
print(prediction.extrinsics.shape)   # Camera poses (w2c): [N, 3, 4] float32
print(prediction.intrinsics.shape)   # Camera intrinsics: [N, 3, 3] float32

Command Line Interface

# Process images with auto mode
da3 auto path/to/images \
    --export-format glb \
    --export-dir output \
    --model-dir depth-anything/da3nested-giant-large

# Use backend for faster repeated inference
da3 backend --model-dir depth-anything/da3nested-giant-large
da3 auto path/to/images --export-format glb --use-backend

Model Details

  • Developed by: ByteDance Seed Team
  • Model Type: Vision Transformer for Visual Geometry
  • Architecture: Plain transformer with unified depth-ray representation
  • Training Data: Public academic datasets only

Key Insights

πŸ’Ž A single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization.

✨ A singular depth-ray representation obviates the need for complex multi-task learning.

Performance

πŸ† Depth Anything 3 significantly outperforms:

  • Depth Anything 2 for monocular depth estimation
  • VGGT for multi-view depth estimation and pose estimation

For detailed benchmarks, please refer to our paper and Visual Geometry Benchmark.

Limitations

  • The model is trained on academic datasets and may have limitations on certain domain-specific images
  • Performance may vary depending on image quality, lighting conditions, and scene complexity
  • ⚠️ Non-commercial use only due to CC BY-NC 4.0 license.

Citation

If you find Depth Anything 3 useful in your research or projects, please cite:

@article{depthanything3,
  title={Depth Anything 3: Recovering the visual space from any views},
  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Links

Authors

Haotong Lin Β· Sili Chen Β· Junhao Liew Β· Donny Y. Chen Β· Zhenyu Li Β· Guang Shi Β· Jiashi Feng Β· Bingyi Kang