Multi Scale Neural Network And Feature Fusion for Monocular Depth Estimation

Original Source Here

Depth estimation from monocular images is a challenging problem in computer vision. In this paper, we tackle this problem using a novel network architecture using multi scale feature fusion. Our network uses two different blocks, first which uses different filter sizes for convolution and merges all the individual feature maps. The second block uses dilated convolutions in place of fully connected layers thus reducing computations and increasing the receptive field. We present a new loss function for training the network which uses a depth regression term, SSIM loss term and a multinomial logistic loss term combined. We train and test our network on Make 3D dataset, NYU Depth V2 dataset and Kitti dataset using standard evaluation metrics for depth estimation comprised of RMSE loss and SILog loss. Our network outperforms previous state of the art methods with lesser parameters.


Deep learning powered by neural networks has been successful in a range of problems in computer vision. Making autonomous Driving a reality requires solving the perception problem. There are a lot of sub-tasks involved like object detection, instance segmentation, depth estimation, scene understanding etc. Neural Networks tries to mimic the human brain by learning from the data without being explicitly programmed (Goodfellow et al., 2016). In this work, we tackle the depth estimation problem especially in the context of autonomous driving.

Depth estimation is an important but complex problem in computer vision. This requires learning a function which calculates the depth map from the input image. Humans have this ability naturally as their brain is able to understand the scene by making use of information from lighting, shading, perspective vision and presence of objects at various sizes (Godard et al., 2017). For humans it is pretty easy to infer the distance at which objects are present from a single image, however the task is quite challenging for a computer (Laina et al., 2016).

Stereo cameras have been traditionally used in Simultaneous Localization and Mapping (SLAM) based systems which has access to depth maps. However using monocular camera offers benefits like low power consumption, light weight and cheap. Hence this approach seems like a better alternative. In the literature, depth estimation has been mostly tackled using stereo cameras (Rajagopalan et al., 2004). Depth estimation from a single image or monocular camera has been lately tackled using a range of convolutional network architectures (Eigen et al., 2014), (Laina et al., 2016) and (Liu et al., 2015a). The problem have been cast as a regression one which uses a Mean Square Error(MSE) in log space as the loss function.

Important Points

* We propose a novel end to end trainable network for monocular depth estimation.

* We present the network architecture, training details, loss functions and ablation studies.

* Our network outperforms previous state of the art networks on Make3D Range Image Data, NYU Depth Dataset V2 and Kitti dataset.


The following datasets have been used for training and testing our network:

  1. Make3D Range Image Data — This dataset was one of the first proposed to infer the depth map from a single image. It has the range data corresponding to each image. Examples from the dataset include outdoor scenes, indoor scenes and synthetic objects (Saxena et al., 2008).
  2. NYU Depth Dataset V2 — This dataset is made up of video sequences from a variety of indoor scenes which have been recorded using both RGB and depth cameras. It has 1449 densely labeled pairs of aligned RGB and depth images. The objects present in the dataset have been individually labelled with a class id (Silberman et al., 2012). The official split consists of 249 training and 215 testing scenes. The images are of resolution is 480×640.
  3. Kitti dataset — This large dataset has over 93 thousand depth maps with corresponding raw Lidar scans and RGB images. This has been the benchmark dataset for depth estimation using a single image for autonomous driving (Geiger et al., 2013). For benchmarking, Eigen split was done by (Eigen et al., 2014). The training set consists of approximately 22 600 frames from a total of 28 different scenes and the validation set contains of 888 frames. The test set contains 697 frames from 28 different scenes. The images are of resolution 376×1242.

Data Augmentation

Data Augmentation is the process in which the dataset size is manually increased by performing operations on the individual samples of the dataset. This leads to better generalization ability thus avoiding overfitting of the network. Data Augmentation has been used successfully for depth estimation (Alhashim and Wonka, 2018) and (Li et al., 2018). The training data was increased using data augmentation:

  • Scale: Colour images are scaled by a random number s ∈ [1, 1.5].
  • Rotation: The colour and depth images are both rotated with a random degree r ∈ [−5, 5].
  • Colour Jitter: The brightness, contrast, and saturation of color images are each scaled by k ∈ [0.6, 1.4].
  • Colour Normalization: RGB images are normalized through mean subtraction and division by standard deviation.
  • Flips: Colour and depth images are both horizontally flipped with a 50% chance. Also nearest neighbor interpolation was used.

Network Architecture

The task is to learn a direct mapping from a colour image to the corresponding depth map. Our network fuses multi scale depth features which is important for depth estimation. Our network removed all the fully connected layers which adds a lot of computational overhead. Although fully connected layers are important in inferring long range contextual information but still it is not required. Instead we use dilated convolutions which enlarges the receptive field without increasing the number of parameters involved.

The network takes as input an image and uses a pre trained ResNet backbone for feature extraction. Convolutions are used at multiple scales with combinations of 1×1 convolution, 3×3 convolution, 5×5 convolution and 7×7 convolution. Instance-wise concat operation is performed to merge the feature maps. This multi scale block is repeated for 4 times. The receptive field of our network increases considerably due to this operation and is able to capture global contextual information in addition to the local information.

The fused features is propagated to another multi scale block. This block is made up of plain convolutional layer and dilated convolutions with dilation rates of 2 and 4 respectively. This block is also repeated for 4 times and instance-wise concat operation is used for merging the feature maps. The network architecture used in this work is presented in Figure 1:

Figure 1: Network architecture used in this work

Multi Scale Fusion

The high level neurons have a larger receptive field in convolutional neural network. Although low level neurons has a smaller receptive field, it contains more detailed information. Hence for better results, we combined feature maps at different scales. We concatenated the high level and intermediate level feature maps using a concat operator. Skip connections also helps the multi scale fusion operation by creating an additional pathway for flow of information.

Implementation Details

State of the art ResNet backbone was used as feature extractor which is trained on the Imagenet dataset. In all the experiments, ADAM optimizer was used with a learning rate value of 0.0001, parameter values momentum as 0.9, weight decay value of 0.0004 and batch size is set to 8. The network was trained using Stochastic Gradient Decent (SGD) for 500K iterations for NYU Depth v2 dataset, 100K iterations for Make3D dataset and 300K iterations for Kitti dataset.


The model predictions compared along with ground truth depth map on NYU v2 dataset is shown in Figure 2:

Figure 2: Qualitative comparison of the estimated depth map on the NYU v2 dataset. Color indicates depth (red is far, blue is close). First row: RGB image, second row: Ground Truth depth map, third row: Results of our proposed method

The model predictions compared along with ground truth depth map on test image number 1 on Kitti dataset is shown in Figure 3:

Figure 3: The output predictions of our network on test image number 1. First row: input image, second row: ground truth depth map, third row: model prediction depth map. Color indicates depth (red is far, blue is close).

The model predictions compared along with ground truth depth map on test image number 5 on Kitti dataset is shown in Figure 4:

Figure 4: The output predictions of our network on test image number 5. First row: input image, second row: ground truth depth map, third row: model prediction depth map. Color indicates depth (red is far, blue is close). Our network fails to detect person in front of the car as well as the person in the bottom left corner


In this paper, we proposed a novel network architecture for monocular depth estimation using multi scale feature fusion. We present the network architecture, training details, loss functions and the evaluation metrics used. We used Make 3D dataset, NYU Depth V2 dataset and Kitti dataset for training and testing our network. Our network not only beats the previous state of the art methods on monocular depth estimation but also has lesser parameters thus making it feasible in a real time setting.


  • I. Alhashim and P. Wonka (2018) High quality monocular depth estimation via transfer learning. arXiv preprint arXiv:1812.11941. Cited by: §3.2 .
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017a) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
  • A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. Cited by: §3.1 .
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1 .
  • K. Karsch, C. Liu, and S. B. Kang (2014) Depth transfer: depth extraction from video using non-parametric sampling.
  • F. Liu, C. Shen, G. Lin, and I. Reid (2015a) Learning depth from single monocular images using deep convolutional neural fields.
  • A. Rajagopalan, S. Chaudhuri, and U. Mudenagudi (2004) Depth estimation and image restoration using defocused stereo pairs.
  • A. Saxena, M. Sun, and A. Y. Ng (2008) Make3d: learning 3d scene structure from a single still image. Cited by: §2 ,§3.1 ,Table 1 ,Table 2 ,Table 3 .

Before You Go




Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: