Depth Anything in 360°:
Towards Scale Invariance in the Wild

Insta360 Research

Point Cloud Visualization

Interactive 3D point cloud: Drag (left-click) to rotate, right-click and drag to pan, scroll to zoom.

Comparison on SUN360 Outdoor Examples

The MoGe series estimates depth using 12 tangent plane images, followed by post-optimization fusion to produce the panoramic depth map, taking over 100 seconds. In stark contrast, other end-to-end methods require only 0.02 to 0.05 seconds.

Comparison on SUN360 Indoor Examples

The MoGe series estimates depth using 12 tangent plane images, followed by post-optimization fusion to produce the panoramic depth map, taking over 100 seconds. In stark contrast, other end-to-end methods require only 0.02 to 0.05 seconds.

Abstract

Panoramic depth estimation provides a comprehensive solution for capturing complete $360^\circ$ environmental structural information, offering significant benefits for robotics and AR/VR applications. However, while extensively studied in indoor settings, its zero-shot generalization to open-world domains lags far behind perspective images, which benefit from abundant training data. This disparity makes transferring capabilities from the perspective domain an attractive solution. To bridge this gap, we present Depth Anything in $360^\circ$ (DA360), a panoramic-adapted version of Depth Anything V2. Our key innovation involves learning a shift parameter from the ViT backbone, transforming the model's scale- and shift-invariant output into a scale-invariant estimate that directly yields well-formed 3D point clouds. This is complemented by integrating circular padding into the DPT decoder to eliminate seam artifacts, ensuring spatially coherent depth maps that respect spherical continuity. Evaluated on standard indoor benchmarks and our newly curated outdoor dataset, Metropolis, DA360 shows substantial gains over its base model, achieving over 50% and 10% relative depth error reduction on indoor and outdoor benchmarks, respectively. Furthermore, DA360 significantly outperforms robust panoramic depth estimation methods, achieving about 30% relative error improvement compared to PanDA across all three test datasets and establishing new state-of-the-art performance for zero-shot panoramic depth estimation.

Framework - Simple & Effective

da360
  • DA360 is initialized with a pinhole depth foundation model, Depth Anything V2, and finetunes it on synthetic panoramic depth datasets under scale-invariant supervision in disparity sapce.
  • The class token is used to predict a shift value to convert the affine-invariant disparity into scale-invariant disparity.
  • The DPT decoder is augmented with the circular padding to ensure seamless results at the panorama boundary.

Scale-Invariant & Seamless Results

da360
DA360 estimates accurate scale-invariant disparity from panoramic images, which can be directly converted into well-structured 3D point clouds. In contrast, neither Depth Anything V2 (predicting affine-invariant disparity) nor PanDA (predicting affine-invariant depth) achieves this capability. Moreover, DA360 generates point clouds without seam artifacts at the panoramic boundaries.

Comparison on Aerial Video

Comparison on 20 ODV360 Video Clips

BibTeX

@article{jiang2025depth,
  title={Depth Anything in $360\^{}$\backslash$circ $: Towards Scale Invariance in the Wild},
  author={Jiang, Hualie and Song, Ziyang and Lou, Zhiqiang and Xu, Rui and Tan, Minglang},
  journal={arXiv preprint arXiv:2512.22819},
  year={2025}
}