Introducing a Class-Aware Metric for Monocular Depth Estimation:
An Automotive Perspective

1Dr. Ing. h.c. F. Porsche AG, Stuttgart, Germany
2Porsche Engineering Group GmbH, Weissach, Germany
3Institute for Applied AI, Stuttgart Media University, Stuttgart, Germany
4University of Freiburg, Freiburg, Germany
5Ulm University, Ulm, Germany

Overview

The increasing accuracy reports of metric monocular depth estimation models leads to a growing interest from the automotive domain. Current model evaluations do not provide deeper insights on the models performance, also in relation to safety critical or unseen classes.

We propose a novel metric that leverages three components, the class wise component, edge and corner image feature component, and a global consistency retaining component. Classes are furtherweighted on their distance in the scene and on criticality for automotive applications.

In the evaluation, we present the benefits of our metric through comparison to classical metrics, class-wise analytics and theretrieval of critical situations. The results show that our metric provides deeper insights into model results, while fulfilling safety criticalrequirements.

Gallery


The top row shows the original image and its segmentation mask from the German Outdoor and Offroad Dataset (GOOSE) . The bottom row presents the depth maps predicted by the highest-ranking models from our evaluation, as per our metric (left) and the Mean Absolute Error (right).

How it works

We introduce a novel depth estimation metric designed for comprehensive scene evaluation. This metric operates at three distinct levels of granularity, which are divided into individual components:

  1. Class-Based Component \( E_{\text{class}} \): Enables insights into the model's performance across a variety of classes, including potential out-of-distribution classes.

  2. Feature Component \( E_{\text{feature}} \): Employs techniques such as edge or corner detection filters to evaluate the model's ability to accurately represent object features.

  3. Global Consistency Component \( E_{\text{global}} \): Integrates standard depth estimation evaluation methods to ensure overall consistency.

Although the individual weighting can be dependent on the specific scenario, we propose the overall combination of components as

\[ L = \gamma \cdot \text{E}_{\text{class}} + \gamma \cdot \text{E}_{\text{feature}} + \gamma \cdot \text{E}_{\text{global}} \] with \( \gamma = 1 \) allowing a near metric offset evaluation while incorporating the class and distance weightings.



1. Class-Based Component

This component measures the metric error of each object class individually, such as cars, trucks, buildings, and poles. This approach provides detailed insights into how different models handle various object classes, improving the understanding of model performance on previously uncommon scenarios.


Intra-Class Weighting

The importance of classes can vary highly between frames and situations. With a focus on classifications masks, a single mask may encompass multiple car instances at varying distances within the scene. Treating these instances similarly throughout different scenes complicates the interpretation of the metric. Therefore weighting the classes is necessary.

Consequently, we propose a distance-based intra-class weighting \( w_{\text{dist}} \), based on the distances within each scene. We define this as:

\[ w_{\text{dist}} = \frac{d_{\text{class}} - \min(D_{\text{classes}})}{\max(D_{\text{classes}}) - \min(D_{\text{classes}})} \]

\[ \text{with } d_{\text{class}} = d_{\text{scene-max}} - d_{\text{class-min}} \]

where \( \text{d}_{\text{scene-max}} \) describes the maximum distance within the entire scene and \( \text{d}_{\text{class-min}} \) the minimum distance within a class.


Automotive Inter-Class Weighting

Since the class importance heavily relies on the use case at hand, the specific weighting of the classes can be chosen individually. As our focus is the use of MMDE models in automotive applications respectively automotive safety, we provide an in-depth weight setup in respect thereof.
Therefore, we leverage accident data and use the distribution between the accident opponent. We source our data from the German In-Depth Accident Study (GIDAS) database. The following table shows the distribution of first accident opponents which we use to weight the class importance.

Main Class Sub Class Distribution
Car-to-Vehicle 62,06%
Car 50,04%
Motorcycle 7,38%
Truck & Van & Bus 3,73%
Trains 0,63%
Other Motorized Vehicle 0,27%
Car-To-VRU 30%
Bicycles 21,95%
Pedestrian 8,05%
Car-To-Object 7,94%
Pole/tree 3,24%
Guardrail 1,17%
Ditch / Embankment 1,07%
Road / Terrain 1,04%
Other Object 0,75%
Wall / bridge 0,56%
Bush / Fence 0,11%


Component Result

The final class-based component is calculated using MAE, the intra-class weight \( w_{\text{dist}} \), and the inter-class weight \( w_{\text{class}} \).

\[ \text {E}_{\text {class}} = \sum_{c=1}^{C} w_{\text{class}} \cdot w_{\text{dist}} \cdot \text {MAE}(I) \]

Achieving an error \( \text{E}_{\text{class}} \) that incorporates how important a class is in general and also how relevant this class is in the respective image situation.


2. Local Feature Component

Another important factor for a qualitative depth map is preserving fine details in the prediction. These details serve multiple purposes, such as better differentiation between individual objects or considering unique - and often relevant - shape changes such as trailer hitches or opened doors on cars.

For the task of extracting possibly relevant features, we apply several classical methods on the unmasked input image. We implement multiple corner detection algorithms, e.g., Harris, given the proven robustness of corner features for computer vision tasks such as feature matching. To further evaluate class-specific differences in the models in question we mask the edge depth map with the previously defined classes

The importance of edge features is dependent on the distance to the capture point, which are scaled by \( w_{\text{dist}} \):

\[ \text {E}_{\text{feature}} = \sum_{c=1}^{C} w_{\text{class}} \cdot w_{\text{dist}} \cdot \text {MAE}(I_{cf}) \]



3. Global Consistency Component

As we aim for a comprehensive evaluation we further examine the global consistency of the generated depth map. In addition, this also covers situations in which no labels or masks for certain objects are provided, as well as global scaling issues not represented in the other components. Therefore we simply calculate Eglobal the MAE between the predicted and ground truth depth.

Benchmark

We compare the results of over 25 GOOSE dataset scenes with classical errors against our metric. While both provide comprehensive insights into the model performances, ours offers a more nuanced interpretation.

Model Variant MAE RMSE Abs-Rel Ours
AdaBins KITTI 13.3 25.21 0.33 20.65
DepthAnything V2 ViT L 8.39 16.56 0.3 14.47
EcoDepth - 10.25 20.51 0.28 17.43
Marigold - 12.70 20.38 0.65 17.72
Metric3D V2 ViT G2 6.47 14.44 0.2 11.57
PatchFusion DA V1 ViT L 15.05 24.33 0.55 23.32
UniDepth V1 ConvNext L 8.26 16.7 0.24 14.19
UniDepth V2 ViT L 8.57 20 0.27 14.24
ZoeDepth NYU + KITTI 9.51 19.32 0.27 16.22


Model Zoo

Citation

@article{ca_mmde,
      title={Introducing a Class-Aware Metric for Monocular Depth Estimation: An Automotive Perspective}, 
      author={Tim Bader and Leon Eisemann and Adrian Pogorzelski and Namrata Jangid and Attila-Balazs Kis},
      year={2024},
      url={https://arxiv.org/abs/2409.04086}, 
}