DEFOM-Stereo: Depth Foundation Model Based Stereo Matching

Benchmarking Results

Comparsion with Previous SOTA Methods

KITTI 2012

ViTAStereo

DEFOM-Stereo

Middlebury

Selective-IGEV

DEFOM-Stereo

ETH3D

LoS

DEFOM-Stereo

Top Ranks on Stereo Leaderboards

Middlebury

KITTI 2015

KITTI 2012

ETH3D

Abstract

Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity at the correct scale. DEFOM-Stereo is verified to have much stronger zero-shot generalization compared with SOTA methods. Moreover, DEFOM-Stereo achieves top performance on the KITTI 2012, KITTI 2015, Middlebury, and ETH3D benchmarks, ranking $1^{st}$ on many metrics. In the joint evaluation under the robust vision challenge, our model simultaneously outperforms previous models on the individual benchmarks, further demonstrating its outstanding capabilities.

Pipeline

We propose a novel recurrent stereo-matching framework incorporating monocular depth cues from a depth foundation model to improve robustness.
We develop a simple technique that utilizes pre-trained DEFOM features to construct stronger combined feature and context encoders.
We invent a recurrent scale update module empowered with the scale lookup, serving to recover accurate pixel-wise scales for the coarse DEFOM depth.

Ablation Study

Ablation study of proposed methods on the Scene Flow test set and zero-shot generation. The baseline is RAFT-Stereo with two levels of correlation pyramids. The parameters counted here are the trainable ones. The time is the inference time for 960$\times$540 inputs. †We found that pre-defining the neighbor sampling indexes within the search radius can significantly accelerate the inference instead of repeatedly defining them in every lookup as RAFT-Stereo's implementation. We also apply this trick to the baseline, otherwise, its inference time would be 0.329s.

Robust Vision Challenge

Quantitative Results

Visual Comparison

KITT 2015

UCFNet_RVC

DEFOM-Stereo_RVC

Middlebury

CREStereo++_RVC

DEFOM-Stereo_RVC

Middlebury

CREStereo++_RVC

DEFOM-Stereo_RVC

ETH3D

LoS_RVC

DEFOM-Stereo_RVC

ETH3D

LoS_RVC

DEFOM-Stereo_RVC

Zero-shot Comparision on Benchmark Datasets

Quantitative Results

Visual Comparison

Middlebury (full)

DEFOM-Stereo (ViT-S)

DEFOM-Stereo (ViT-L)

RAFT-Stereo

Mocha-Stereo

Selective-IGEV

Middlebury (full)

DEFOM-Stereo (ViT-S)

DEFOM-Stereo (ViT-L)

RAFT-Stereo

Mocha-Stereo

Selective-IGEV

Middlebury (full)

DEFOM-Stereo (ViT-S)

DEFOM-Stereo (ViT-L)

RAFT-Stereo

Mocha-Stereo

Selective-IGEV

Middlebury (half)

DEFOM-Stereo (ViT-S)

DEFOM-Stereo (ViT-L)

RAFT-Stereo

Mocha-Stereo

Selective-IGEV

Middlebury (half)

DEFOM-Stereo (ViT-S)

DEFOM-Stereo (ViT-L)

RAFT-Stereo

Mocha-Stereo

Selective-IGEV

ETH3D

DEFOM-Stereo (ViT-S)

DEFOM-Stereo (ViT-L)

RAFT-Stereo

Mocha-Stereo

Selective-IGEV

ETH3D

DEFOM-Stereo (ViT-S)

DEFOM-Stereo (ViT-L)

RAFT-Stereo

Mocha-Stereo

Selective-IGEV

KITT 2012

DEFOM-Stereo (ViT-S)

RAFT-Stereo

DEFOM-Stereo (ViT-L)

Mocha-Stereo

Selective-IGEV

KITT 2012

RAFT-Stereo

DEFOM-Stereo (ViT-S)

DEFOM-Stereo (ViT-L)

Mocha-Stereo

Selective-IGEV

KITT 2015

DEFOM-Stereo (ViT-S)

RAFT-Stereo

DEFOM-Stereo (ViT-L)

Mocha-Stereo

Selective-IGEV

KITT 2015

DEFOM-Stereo (ViT-S)

RAFT-Stereo

DEFOM-Stereo (ViT-L)

Mocha-Stereo

Selective-IGEV

BibTeX

@misc{jiang2025defomstereo,
    title={DEFOM-Stereo: Depth Foundation Model Based Stereo Matching},
    author={Hualie Jiang and Zhiqiang Lou and Laiyan Ding and Rui Xu and Minglang Tan and Wenjie Jiang and Rui Huang},
    year={2025},
    eprint={2501.09466},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

DEFOM-Stereo: Depth Foundation Model Based Stereo Matching

Benchmarking Results

Comparsion with Previous SOTA Methods

Top Ranks on Stereo Leaderboards

Abstract

Pipeline

Ablation Study

Robust Vision Challenge

Quantitative Results

Visual Comparison

Zero-shot Comparision on Benchmark Datasets

Quantitative Results

Visual Comparison

Zeroshot Comparison on Flickr 1024

BibTeX