Depth Anything 3: DA3NESTED-GIANT-LARGE
Model Description
DA3 Nested model combining the any-view Giant model with the metric Large model for metric-scale visual geometry reconstruction. This is our recommended model that combines all capabilities.
| Property | Value |
|---|---|
| Model Series | Nested |
| Parameters | 1.40B |
| License | CC BY-NC 4.0 |
β οΈ Non-commercial use only due to CC BY-NC 4.0 license.
Capabilities
- β Relative Depth
- β Pose Estimation
- β Pose Conditioning
- β 3D Gaussians
- β Metric Depth
- β Sky Segmentation
Quick Start
Installation
pip install depth-anything-3
Basic Example
import torch
from depth_anything_3.api import DepthAnything3
# Load model from Hugging Face Hub
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DepthAnything3.from_pretrained("depth-anything/da3nested-giant-large")
model = model.to(device=device)
# Run inference on images
images = ["image1.jpg", "image2.jpg"] # List of image paths, PIL Images, or numpy arrays
prediction = model.inference(
images,
export_dir="output",
export_format="glb" # Options: glb, npz, ply, mini_npz, gs_ply, gs_video
)
# Access results
print(prediction.depth.shape) # Depth maps: [N, H, W] float32
print(prediction.conf.shape) # Confidence maps: [N, H, W] float32
print(prediction.extrinsics.shape) # Camera poses (w2c): [N, 3, 4] float32
print(prediction.intrinsics.shape) # Camera intrinsics: [N, 3, 3] float32
Command Line Interface
# Process images with auto mode
da3 auto path/to/images \
--export-format glb \
--export-dir output \
--model-dir depth-anything/da3nested-giant-large
# Use backend for faster repeated inference
da3 backend --model-dir depth-anything/da3nested-giant-large
da3 auto path/to/images --export-format glb --use-backend
Model Details
- Developed by: ByteDance Seed Team
- Model Type: Vision Transformer for Visual Geometry
- Architecture: Plain transformer with unified depth-ray representation
- Training Data: Public academic datasets only
Key Insights
π A single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization.
β¨ A singular depth-ray representation obviates the need for complex multi-task learning.
Performance
π Depth Anything 3 significantly outperforms:
- Depth Anything 2 for monocular depth estimation
- VGGT for multi-view depth estimation and pose estimation
For detailed benchmarks, please refer to our paper and Visual Geometry Benchmark.
Limitations
- The model is trained on academic datasets and may have limitations on certain domain-specific images
- Performance may vary depending on image quality, lighting conditions, and scene complexity
- β οΈ Non-commercial use only due to CC BY-NC 4.0 license.
Citation
If you find Depth Anything 3 useful in your research or projects, please cite:
@article{depthanything3,
title={Depth Anything 3: Recovering the visual space from any views},
author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025}
}
Links
- π Project Page
- π Paper
- π» GitHub Repository
- π€ Hugging Face Demo
- π Visual Geometry Benchmark
- π Documentation
Authors
Haotong Lin Β· Sili Chen Β· Junhao Liew Β· Donny Y. Chen Β· Zhenyu Li Β· Guang Shi Β· Jiashi Feng Β· Bingyi Kang
- Downloads last month
- 43,240