Understanding Perception and Motion Planning for Autonomous Driving (2021)


Original Source Here


Code available here.

Depth estimation from a single RGB image. Image from the TRI-AD blog [7]

Adrien Gaidon from TRI-AD believes that supervised learning won’t scale, generalize and last. That’s why he’s looking for a way to scale supervision efficiently…without labeling! They found how to do it: they use self-supervision.

Without any supervised labels, his TRI-AD AI team could reconstruct 3D point clouds from monocular images.

How is it possible? They use prior knowledge of projective geometry to produce the desired output with their new model PackNet. They achieve very good results and their self-supervised model outperforms the supervised model for this task.

Self-supervised training does not require any depth data. Instead, it is trained to synthesize depth as an intermediate.

Their approach solves a bottleneck existing because of the loss of the resolution of the input image after passing through a traditional conv-net (due to pooling). Therefore, they’ve adapted the convolutional network architecture to the depth estimation task. Their PackNet model has the advantage of preserving the resolution of the target image thanks to tensor manipulation and 3D convolutions.

Given a single image as test time, they aim to learn:

  • the function f_d that maps an image into a per-pixel depth,
  • the monocular ego-motion estimator f_x containing the rotation/ translation of a source image to a target image.

We’ll focus on the first learning objective: prediction of depth.

The depth estimation problem is an image reconstruction problem during training, boiling to learning the traditional Computer Vision problem of Structure-from-Motion (SfM).

Structure-from-motion in traditional computer vision. Image from Humboldt State University.

The PackNet model is learning this SfM with a single camera with two main blocks.

Image from original paper [2].
  1. Packing block: the input RGB image tensor is reduced by an invertible Space2Depth technique [6], then the network learns to compress and learn to expand with 3D conv layers. The result is flattened by reshaping and fed to 2D convolutional layers.
  2. Unpacking block learns to decompress and unfold the packed convolutional features back to a higher resolution. It uses again 2D then 3D layers, then reshaped and expanded by Depth2Space technique [6].

During training, the network learns to generate an image Î_t by sampling pixels from source images.


Their loss for depth mapping is divided into two components:

Loss for depth mapping. Image by the author.

Appearance matching loss L_p: evaluate the pixel similarity between the target image I_t and the synthesized image Î_t using the Structural similarity term and an L1 loss term

Depth Regularization loss L_s: they encourage the estimated depth map to be locally smooth with an L1 penalty on their gradients. Because there are depth discontinuities on the object edges, to avoid losing the textureless, low-gradient regions this smoothing is weighted to be lower when the image gradient is high.

We don’t cover that here but their loss leverage the camera velocity when available to solve inherent scale ambiguity from monocular vision.


Comparison of image reconstruction with the traditional pipeline ( b ) and with proposed method ( c ). The proposed method preserves the details. Image from original paper [2].

Their model outperforms self, semi, and fully supervised methods on the well-known KITTI benchmark.

2021 Updates:

They recently have extended to a 360 degrees camera configuration with their new 2021 model: Full Surround Monodepth from Multiple Cameras, Vitor Guizilini et al.

To do that, they’ve used spatio-temporal information very cunningly. The six-camera they use overlap too little to reconstruct the image of one camera (camera A) in the frame of another camera (camera B). To solve this issue, they actually use the image from a past frame of camera A to be projected in the current frame of camera B.

FSM model predicts point cloud (on the right image) from 6 cameras rig configuration (as shown on the left). Image from original paper [2]

Semi-supervised = self supervision + sparse data.

This year, TRI-AD also presented a semi-supervised inference network: “Sparse Auxiliary Networks (SANs) for Unified Monocular Depth Prediction and Completion”, Vitor Guizilini et al. These SANs can perform both depth prediction and completion depending whether only RGB image or sparse point clouds are available at inference time.

Section 2) Using LiDAR and HD Maps

Self-driving cars originally use LiDAR, a laser sensor, and High Definition Maps to predict and plan their motion. In recent years, the use of multi-task deep learning has created end-to-end models for navigating with LiDAR technology.

Why using HD Maps? The HD Maps contains information about the semantic scene (lanes, location stop signs, etc).

The overall architecture proposed in the “Perceive, Predict, Plan” paper. Image from paper [3].

I’ll present a paper published by Uber ATG in ECCV 2020: “Perceive, Predict, and Plan” [3]. This paper was presented by one of its authors Raquel Ursatun who has funded this year her own AD startup called Waadi.

Her paper proposes an end-to-end model that jointly perceives, predicts, and plans the motion of the car. They created a new intermediate representation to learn their objective function: a semantic occupancy grid to evaluate the cost of each trajectory of the motion planning process. This semantic layer is also used as an intermediate and interpretable result. This grid makes the AD safer than conventional approaches because it doesn’t rely on relying on a threshold to detect objects and that can detect any shape

Their model is divided into three blocks.

  1. Perception block
DetectionInference end-to-end perception and recurrent occupancy model. || is concatenation, + element-wise summation and delta bilinear interpolation for downscaling. Image from paper [3].

The perception model first extracts features independently from both LiDAR measurement and HD Maps. To do that, they’ve voxelized 10 successive sweeps of liDAR as T=10 frames and transform them into the present car frame in BEV (bird’s eye view). This creates a 3D discrete grid with a binary value for each cell: is occupied or empty.

They concatenate the time axis on the Z-axis to obtain a (HxWxZT) tensor. The concatenation over the 3rd axis enables to use 2D convolutions backbone network later.

The map’s information is stored in a M channel tensor. Each channel contains a distinct map element (road, lane, stop sign, etc). Then the final input tensor is HxWx(ZT+M). In practice ZT+M=17 binary channels.

The input tensor is then fed into a backbone network made of two streams. One stream for LiDAR and maps features respectively. The streams are only different by the number of features used (more features fore LiDAR stream). The outputs are concatenated and fed into the last block of convolutional layers to output a 256-dim feature.

2. Prediction block

The semantic class for prediction is organized into hierarchized groups. Each group will have a different planning cost (parked vehicle has less importance than moving ones). Each of these groups is represented as a collection of categorical random variables over space and time (0.4m/pixel for x,y grid and 0.5s in time (so 10 sweeps create a 5s window). In a nutshell, the goal of the prediction is to answer the question: who (which instance of which class) is going to move where?

They use again 2D convolutions blocks mainly with two parallel streams that have different dilatation rates. One stream with fined-grained features targets prediction for a recent future. The other stream uses coarser features with dilated convolutions for long-time prediction. The output is updated in a recurrent fashion with the previous output and the concatenated features.

3. Motion Planning block

They sample a diverse set of trajectories from the ego-car and pick the one that minimizes a learned cost function. This cost function is a sum of a cost function: fo that takes into account the semantic occupancy forecast mainly and fr related to comfort safety and traffic rules.

fo is composed of two terms: the first term penalizes trajectories intersecting region with high probability, the second term penalizes high-velocity motion in areas with uncertain occupancy.

f is the cost function, tau the trajectory, x the input data, o the occupancy prediction, and w the learnable parameters. [3]
t is the time step, c is a cell of the occupancy grid, tau a trajectory and lambda is a margin parameter to add to surrounding object to avoid collisions, v is the velocity. [3]

These cost functions are used in the final multi-task objective function:

Summation of semantic occupancy loss and planning loss. [3]

Semantic occupancy loss L_s is a cross-entropy loss between the ground distribution p and predicted distribution q of the semantic occupancy random variables.

Planning loss L_M is a max-margin loss that encourages the human driving trajectory (ground truth) to have a smaller cost than other trajectories.


Results in a 5s frame. Image from the paper [3]

As a result, adding occupancy grids representation to the model outperforms state-of-the-art methods regarding the number of collisions. Training end-to-end (rather than one block after another) the whole pipelines improve safety (10%) and human imitation (5%).


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: