Depth Estimation From a Single RGB Image: State of 2025
MiDaS, ZoeDepth, Depth Anything, and Marigold: benchmarks, architectures, and the stubborn failure modes that separate the lab from the field.
Monocular depth estimation in 2025 is dominated by four families: MiDaS 3.1 (a strong relative-depth baseline), ZoeDepth (metric, hybrid indoor-outdoor), Depth Anything V2 (scaled data, best zero-shot), and Marigold (diffusion-based, sharpest edges). Depth Anything V2 leads most NYU-v2 and KITTI leaderboards. Remaining failure modes — transparent surfaces, thin structures, and metric calibration on novel cameras — remain stubborn.
Single-image depth estimation has moved from "interesting research problem" to "deployable building block" in the seven years since Eigen, Puhrsch and Fergus (NeurIPS 2014) first trained a CNN to predict depth from an RGB image. The 2025 picture is busier than the 2020 picture, but the problem statement has not changed: given one RGB frame and no scene priors, produce a per-pixel depth map that is consistent with the world.
This review walks through the four model families that matter in 2025, their benchmark numbers on NYU Depth V2 and KITTI, the architectural choices that distinguish them, and the failure modes that continue to stop monocular depth from replacing stereo and LiDAR in safety-critical pipelines. We close with a brief note on applied transfer — specifically, why depth estimation has become the enabling primitive for photo-based portion sizing in consumer food tracking.
Why monocular depth is hard
The task is underdetermined. Any RGB image is consistent with an infinite family of 3D scenes, because a small object close to the camera produces the same projection as a large object far from the camera. Humans resolve this ambiguity using learned priors: we know roughly how big a coffee mug is, we know that floors tend to be flat, we know that a shadow implies a surface. Monocular depth networks succeed to the extent they internalize the same priors from training data.
That is why data scale matters so much. Depth Anything V2's principal contribution is not an architectural innovation but a pipeline for scaling unlabeled image data through teacher-student pseudo-labeling. The architecture is, by 2025 standards, conservative — a DINOv2 ViT backbone with a DPT-style prediction head.
The four model families
MiDaS 3.1 (Intel/ISL, 2023)
MiDaS (Ranftl et al., TPAMI 2022) is the workhorse. The 3.1 release, published on the Intel ISL GitHub in late 2023, added BEiT-large and Swin2-L backbone variants and pushed AbsRel on NYU-v2 to 0.074 with the BEiT-L/512 configuration. MiDaS predicts relative depth (inverse depth up to scale and shift), which is what most downstream tasks actually need. The training recipe combines twelve datasets with a scale-and-shift-invariant loss, which is the trick that made zero-shot transfer work in the first place.
MiDaS-small, the 21M-parameter variant, is the on-device favorite. It runs at roughly 30 FPS at 256x256 on an iPhone 15 Pro through Core ML and at roughly 18 FPS on a Pixel 8 through ONNX Runtime Mobile with INT8 quantization.
ZoeDepth (2023)
ZoeDepth (Bhat et al., 2023) addresses the metric-depth problem by chaining a MiDaS-style relative-depth backbone with two metric heads — one trained on indoor data (NYU-v2) and one on outdoor data (KITTI) — plus a classifier that decides which head to use at inference. The practical appeal is that you get metric depth (meters) without retraining a backbone per domain.
ZoeDepth's weakness is the classifier. When indoor and outdoor scenes appear in the same image — a shop window, a patio door — the classifier mis-routes, and depth discontinuities appear along the classification boundary. The model is still a frequent choice for applications where the domain is known in advance and the indoor/outdoor classifier can be pinned.
Depth Anything V2 (2024)
Depth Anything V2 (Yang et al., CVPR 2024) is the 2025 front-runner on most zero-shot benchmarks. The architectural story is DINOv2 + DPT, which by itself is not novel. The story that matters is the data pipeline: the authors trained a teacher on high-quality synthetic data (Hypersim, Virtual KITTI 2), then used that teacher to label 62 million real images, then trained the student on the combination. The result is a family of models (Small 25M, Base 97M, Large 335M, Giant 1.3B) with a strong quality-to-size curve.
Depth Anything V2 Metric (a separate release) addresses the metric problem by fine-tuning on specific metric datasets. It is not as domain-general as the relative model but beats ZoeDepth on indoor metric AbsRel by roughly 8-12% depending on configuration.
Marigold (CVPR 2024)
Marigold (Ke et al., CVPR 2024) is the most architecturally interesting of the four. The authors start from Stable Diffusion V2 and fine-tune it as a conditional depth generator: the RGB image conditions the denoising process, and the output is a depth map rather than an RGB image. Fine-tuning takes a few days on a single 8xA100 node, because Stable Diffusion has already learned an excellent visual prior.
The qualitative result is striking — Marigold produces the sharpest depth boundaries of any model we tested, particularly on fine structure (hair, foliage, wires) that MiDaS and Depth Anything consistently over-smooth. The cost is latency: a 10-step DDIM schedule takes roughly 0.8 seconds on an A100 at 768x768, and the paper recommends 10-50 ensemble passes for best results.
Benchmark numbers
Numbers below are as reported in each method's original paper or official release, measured on NYU Depth V2 (indoor) and KITTI Eigen split (outdoor). Lower AbsRel and RMSE are better; higher delta-1.25 is better.
| Model | Params | NYU AbsRel | NYU δ<1.25 | KITTI AbsRel | KITTI RMSE |
|---|---|---|---|---|---|
| MiDaS 3.1 (BEiT-L/512) | 345M | 0.074 | 0.941 | 0.090 | 3.42 |
| MiDaS-small | 21M | 0.119 | 0.861 | 0.142 | 5.18 |
| ZoeDepth (M12) | 335M | 0.075 | 0.955 | 0.057 | 2.28 |
| Depth Anything V1 (L) | 335M | 0.056 | 0.984 | 0.046 | 2.07 |
| Depth Anything V2 (L) | 335M | 0.045 | 0.979 | 0.074 | 2.54 |
| Marigold (ensemble=10) | ~865M | 0.055 | 0.964 | 0.099 | n/a* |
Figure 1. Monocular depth estimation benchmarks on NYU Depth V2 and KITTI Eigen split. *Marigold is evaluated primarily on affine-invariant metrics; RMSE in meters requires a separate metric head. Figures per original publications.
A few observations. First, Depth Anything V2 does not always beat V1 on KITTI; the V2 paper is explicit that V2 trades some outdoor performance for much better fine-structure preservation. Second, ZoeDepth remains competitive on KITTI despite being older, because its outdoor head was trained specifically on KITTI. Third, MiDaS-small at 21M parameters is the best option when you need sub-100ms inference on a phone.
A reference inference snippet
Depth Anything V2 through HuggingFace Transformers, for reproducibility. This is the pattern most 2025 production codepaths start from; swap the model ID for MiDaS or ZoeDepth weights as needed.
import torch
from transformers import pipeline
from PIL import Image
# Depth Anything V2 Large, via HuggingFace pipeline
pipe = pipeline(
task="depth-estimation",
model="depth-anything/Depth-Anything-V2-Large-hf",
device=0 if torch.cuda.is_available() else -1,
)
image = Image.open("scene.jpg")
out = pipe(image)
# out["predicted_depth"] is a torch.Tensor (H, W), relative (affine-invariant)
# out["depth"] is a PIL.Image visualization
# For metric depth on a known domain (indoor), swap to:
# model="depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf" Transfer to novel domains
The stated strength of the 2024-2025 generation is zero-shot transfer. In practice, transfer is strongest when the target domain resembles the training prior. Depth Anything V2 was trained primarily on natural images (ImageNet-scale distributions), which means it does well on street scenes, rooms, landscapes, and faces, and less well on microscopy, medical imaging, and tightly-cropped close-ups of small objects.
One domain worth flagging for applied readers: close-up photography of food on a plate. This is an unusual monocular-depth setup — the camera is typically 20-40cm above the subject, the object sizes are small, and the reference surface (the plate) provides a useful but not always reliable scale cue. Depth estimation is the key enabler for photo-based portion sizing in food tracking, because once you have a depth map plus a plate-anchored scale, you can reconstruct the volume of each recognized food item. The vertical geometry is unfamiliar to most off-the-shelf depth models, which means practitioners typically fine-tune on domain-specific data. We will return to this problem in a future piece dedicated to food-volume reconstruction.
Remaining failure modes
Three failure modes are the ones that actually cost production teams time in 2025.
Transparent and reflective surfaces. A mirror or a plate-glass window violates the one-pixel-one-surface assumption that underlies every depth model. Current models generally predict the depth of whatever is visible through the glass, which is almost always wrong. There is no good fix without an auxiliary segmentation signal.
Thin structures. Wires, fences, window mullions, and hair all get over-smoothed by MiDaS and Depth Anything. Marigold is noticeably better here, at latency cost.
Metric calibration across cameras. Metric depth requires knowing the camera intrinsics — at minimum, the focal length. Models that ignore intrinsics (most relative-depth models) cannot produce true metric output without a separate calibration step. This is why applications that need metric output either constrain the camera (fixed phone, known lens) or estimate intrinsics jointly.
Updated 2026
Updated 2026: Depth Anything V2 remains a strong default in 2026, but the landscape has shifted in two ways. First, Marigold-LCM and subsequent distilled diffusion variants have collapsed the 10-step ensemble cost to a single-step inference, making diffusion-based depth competitive on latency. Second, consumer applications have started shipping domain-fine-tuned depth models — most visibly in food tracking, where we cover one specific approach in our food recognition technical overview.
Conclusion
Single-image depth estimation in 2025 is better than most practitioners realize. Depth Anything V2 at Large size matches the quality of stereo methods from three years ago, runs in under 50ms on an A100, and transfers zero-shot across most natural-image domains. Marigold's diffusion-based approach validates a direction that will probably dominate the next generation. The remaining open problems — transparency, thin structure, and metric calibration — are real but bounded, and the techniques for working around them (segmentation-guided re-prediction, domain fine-tuning, intrinsics estimation) are well understood.
Frequently asked questions
What is monocular depth estimation?
Monocular depth estimation predicts a per-pixel depth map from a single RGB image, without stereo pairs or active sensors. Modern methods train on large datasets of RGB-depth pairs captured by Kinect, LiDAR, or structure-from-motion reconstructions.
Is MiDaS 3.1 still state of the art in 2025?
No. MiDaS 3.1 is a strong baseline, but Depth Anything V2 and Marigold produce sharper and more consistent predictions on zero-shot benchmarks. MiDaS-small survives as a popular on-device option because of its favorable parameter-to-accuracy ratio.
What is the difference between relative and metric depth?
Relative depth predicts ordering up to unknown scale. Metric depth predicts absolute distance in meters. Metric depth is harder to generalize across cameras because focal length matters. ZoeDepth and Depth Anything V2 Metric both attempt to unify the two.
Why are diffusion-based depth models interesting?
Marigold showed that repurposing a pretrained Stable Diffusion U-Net as a depth estimator yields exceptionally sharp predictions on unseen scenes. The approach validates the idea that large generative priors transfer well to perception tasks.
Can monocular depth replace LiDAR in self-driving?
Not for safety-critical metric ranging. Monocular depth is reliable for relative geometry and as a dense prior complementing sparse LiDAR. Metric accuracy degrades with distance and on uncommon surfaces.
What is the NYU Depth V2 benchmark?
NYU Depth V2 (Silberman et al., ECCV 2012) is an indoor RGB-D dataset captured with Microsoft Kinect. It contains 1,449 densely labeled pairs from 464 scenes and is the canonical indoor monocular depth benchmark.
Can depth estimation be deployed on a phone in 2025?
Yes, for moderate resolutions. MiDaS-small and Depth Anything Small run at interactive rates on recent iPhones and Pixel phones via Core ML or ONNX Runtime Mobile, typically at 256x256 or 384x384 with INT8 quantization.
What is the hardest failure mode for monocular depth?
Transparent and reflective surfaces. Glass and mirrors violate the single-surface-per-pixel assumption. Models typically hallucinate depth from whatever is visible through or reflected in the surface, which is almost always wrong.