Original Source Here
Drivable Space in Autonomous Driving
Recent trends in academic research on drivable space as of 2023
Drivable space, or Free Space, plays a safety-critical role in autonomous driving. In a previous blog post, we reviewed the definition and importance of this often-overlooked perception feature. In this article, we will review recent trends in academic research.
Drivable space detection algorithms can be measured in two dimensions: inputs and outputs. Regarding the input sensor modality, drivable space detection methods can be categorized as vision-based, LiDAR-based, or vision-LiDAR fusion methods. Regarding the output spatial representation, they can be categorized as 2D perspective image space, 3D space, and Bird’s Eye View (BEV) space.
Visual images are intrinsically 2D, and lidar point cloud measurements are intrinsically 3D. As discussed in a previous blog post, BEV space is essentially a simplified or degenerate 3D space, and we will use BEV space and 3D space interchangeably throughout this blog. In essence, we have a 2×2 input-output matrix for the evaluation of all drivable space algorithms, as shown in the image below. The top right quadrant in green is the North Star which has the best representation power while being the most cost-effective. In the following sections, we will discuss various algorithms that fall into three categories: 2D-to-2D, 3D-to-3D, and 2D-to-3D.
It is worth noting that there is currently no universally recognized standard for expressing and evaluating the accuracy of drivable space in autonomous driving. In this post, we will review related tasks, which can take a variety of formulations. We also provide some insights on the future direction of the academic community, with the aim of accelerating research on this crucial task in autonomous driving.
2D-to-2D Methods (with Images)
The detection of drivable space in perspective 2D image space is essentially the task of image segmentation. There are two main approaches: one is stixel-based obstacle detection, and the other is drivable space semantic segmentation.
The stixel representation
The concept of the stixel (a combination of stick and pixel) approach assumes that the area corresponding to the pixel at the bottom of the image is drivable from the driving perspective. It then extends and grows a stick toward the top of the image until it encounters obstacles, thus obtaining a drivable space for that column. One of the most representative works in this approach is the StixelNet series. Stixel abstracts general obstacles on the ground into sticks and divides image space into drivable space and obstacles. The stixel representation strikes a good balance between pixels and objects, reaching good accuracy and efficiency.
Deep learning has made rapid progress in recent years, allowing detection to be directly modeled as a semantic segmentation problem using convolutional neural networks (CNNs). This approach differs from the method based on image list representation, as it directly classifies whether the pixels of 2D images are drivable spaces. Typical work include DS-Generator and RPP.
Compared to the Stixel approach, the general semantic segmentation approach is more flexible. However, it requires more complex post-processing to make it useful for downstream components. For example, the semantic prediction may not be as contiguous (as clean as the result shown in the following illustration). In the Stixel approach, only one pixel per column is selected to be converted to 3D information.
Lifting to 3D
Since the downstream tasks of prediction and planning calculation occur in 3D space, it is necessary to convert the drivable space results obtained on 2D images to 3D or the degenerated BEV. Common 2D-to-3D post-processing techniques include Inverse Perspective Mapping (IPM), monocular/stereo depth estimation, and using direct 3D physical measurements such as LiDAR point clouds. In addition, the 2D-to-2D algorithm takes each camera stream separately and thus needs explicit rules to piece them together for 360 deg perception in a multi-camera setup.
The tedious postprocessing in both 2D perspective space and 2D-to-3D conversion are typically handcrafted and these brittle logics are susceptible to corner cases. In reality, 2D-to-2D algorithms are rarely used in autonomous driving, except in low-speed scenarios such as parking.
3D-to-3D Methods (with Lidar)
These 3D drivable space algorithms take in lidar point clouds and generate 3D drivable space directly. While early studies are mostly based on LiDAR, recently (in early 2023) we see an explosion of vision-based semantic 3D occupancy prediction, which we will explore in the next section.
Lidar Ground segmentation
Ground segmentation-based methods aim to divide LiDAR point cloud data into ground and non-ground parts. These methods can be categorized into geometric rule-based algorithms and deep learning-based algorithms. Even before the widespread adoption of deep learning, algorithms based on geometric rules were already widely used in LiDAR point clouds to achieve ground detection (or the complementary task of curb detection) and general obstacle detection tasks. These efforts typically rely on plane fitting and region growth algorithms, which were first introduced in 2007 and 2009 in the DARPA Urban Challenge. However, simple ground plane assumptions fail when encountering uneven ground, potholes, and uphill-and-downhill scenes. To account for local uneven roads and overall smoothness, several studies proposed introducing algorithm optimization based on the Gaussian process.
Deep learning-based approaches are gaining popularity with the availability of more computational resources and large-scale datasets. The task of ground segmentation can be formulated as a general semantic segmentation of the lidar point cloud. LidarMTL is one typical work that proposes a multitask model with six tasks, including road structure understanding added on top of dynamic obstacle detection. For road scene understanding, two semantic segmentation tasks of drivable space and ground are designed, along with a ground height regression task. Interestingly, auxiliary tasks, such as foreground segmentation and intra-object part location, have also been shown to benefit dynamic object detection.
Freespace Forecaster and its differentiable version use freespace-centric representation to predict a drivable space for motion planning. These methods obtain ground and obstacle point cloud or grid information from the vehicle as the center, in polar coordinates, through simple ray-casting to calculate the accessibility relationship. The representation is quite similar to the Stixel representation in 2D perspective space in the sense that the closest obstacle is located per bin (stixel in 2D and polar angle bin in 3D).
Occupancy Grid and Scene Flow Representation
Among LiDAR-based algorithms, the most general approach uses Occupancy Grid and Scene Flow to express the position and movement of general obstacles, respectively. Two representative papers are MotionNet and PointMotionNet.
Dynamic object detection and tracking tasks are prevalent in public datasets, enabling academic researchers to generate ground truth for occupancy grid and scene flow. Occupancy representation, in theory, allows for the detection of general obstacles, covering unknown objects with arbitrary shapes, as well as static obstacles and road structures. However, in practice, quantifying the effectiveness of the algorithm is challenging, as the majority of objects in public datasets are regular. A dataset and benchmark for the detection of general obstacles are necessary to further advance the research in this field.
At first glance, it may seem counterintuitive that an object detection algorithm can generate ground-truth to power even more powerful and flexible algorithms of occupancy prediction. However, there are two important aspects to consider. Firstly, object detection can assist with the heavy lifting and bootstrapping during ground-truth annotation, while human quality assurance and minor adjustments are still necessary in real production environment. Secondly, algorithm formulation plays a crucial role. The formulation of occupancy prediction makes it more flexible and capable of learning subtleties that may have been missed by object detection.
The lidar-based occupancy algorithms are not to be confused with the Occupancy Network proposed by Tesla. The Occupancy Network idea is a vision-centric algorithm.
2D-to-3D Method (BEV perception and more)
The multi-camera BEV perception framework (see my previous blog post on BEV perception) raises the visual 3D perception performance to a new level by simplifying the steps of multi-camera post-processing and post-fusion. Additionally, the framework successfully unifies camera and LiDAR algorithms in the expression space, providing a convenient framework for sensor fusion, including both front and post-fusion.
Detecting the physical edge of a road in BEV space is a subset of the drivable space detection task. While road edges and lane lines can be expressed by vector lines, road boundaries lack constraints such as parallelism, fixed lane width, and the intersection at extinction points, making their shapes and positions more flexible. These more free and diverse road edges can be modeled in several ways:
- Heatmap-based: a heatmap is produced by a semantic segmentation-like decoder. The heatmap needs to be processed to vector elements to be consumed by downstream components.
- Voxel-based: an extension of heatmap-based method. The 2D BEV grid in a heatmap is extended to 3D voxels.
- Vector-based: vectorized output is produced, based on primitive geometric elements such as polylines and polygons. These outputs can be passed for downstream consumption directly.
The semantic segmentation-based method can be divided into two categories based on the target modeling approach: Road Boundary semantic segmentation and Road Layout semantic segmentation. The former, such as HDMapNet, can predict lane lines while outputting road edges. The neural network output is a heat map that requires binarization and clustering to produce vector outputs that can be used for downstream prediction and regulation.
Other methods are based on semantic segmentation of the road structure, such as PETRv2, CVT, and Monolayout. The output is the road body itself, and its edge is the road boundary. The neural network output is still a heat map that needs to be binarized, and taking the edge operation can obtain the vectorized road boundary. If downstream regulation can directly consume the road structure itself, such as planning based on occupying a grid, using this perception method is more direct. However, this topic is more related to regulation and control, so I will not expand on it in detail here.
Voxel decoder (semantic occupancy prediction)
The idea of vision-only occupancy prediction has seen explosive growth in the early months of 2023, following the proposal of the Occupancy Network by Tesla in 2022. This voxel output representation can be seen as an extension of the heatmap representation, with one additional dimension of height predicted for each heatmap grid location.
One notable work in this area is SurroundOcc. It first designed an automatic pipeline to generate dense occupancy ground truth from sparse point clouds and then used this dense ground truth to supervise the learning of a dense occupancy grid from multi-camera image streams. I will write another blog on this topic to discuss the background, approaches, and caveats of this line of work.
Both the heatmap-based and voxel-based methods are based on semantic segmentation and require heavy post-processing for downstream consumption. In comparison, methods based on direct vector output are more straightforward. Representative methods include STSU, MapTR, and VectorMapNet, which directly output vectorized road edges. This method can be considered a variant of the anchor-based object detection method, where the basic geometric elements are multi-segment lines or polygons with 2N degrees of freedom, where N is the number of points of the multi-segment line or polygon. MapTR and VectorMapNet are examples of such methods. It is worth mentioning that STSU uses the Bezier Curve with 3 control points, which is novel but not as flexible as multi-segment lines and polygons, and its current effect is not as good.
Camera and Lidar fusion
The above 2D-to-3D methods leverage the rich texture and semantic information of camera images to reason about the 3D environment around the ego vehicle. While LiDAR point cloud data may lack these rich semantics, it provides accurate 3D position measurement information.
Multi-camera BEV space and LiDAR BEV space can be easily unified in BEV fusion. For example, in HDMapNet, the LiDAR-only method and the Camera-LiDAR Fusion method are also compared to the camera-only baseline. Even though the camera may perform slightly worse than the LiDAR-only method in BEV positioning, multi-camera IoU indicators are still better for lane dividers and pedestrian crossing, while LiDAR is better at detecting road boundaries. This is understandable as road boundaries are typically accompanied by height changes and are easier to be detected by active lidar measurements.
Currently, there is no widely accepted dataset for driveable space in autonomous driving. However, there are some related tasks, such as point cloud segmentation in the 3D-to-3D scheme and HD map prediction for the 2D-to-3D scheme. Unfortunately, the cost of producing 3D point cloud segmentation and HD maps is high, which limits academic research to a few public datasets. These datasets include NuScenes, Waymo, KITTI360, and Lyft. These datasets offer 3D point cloud segmentation and labeling, with some road information like road surface, roadside, and crosswalk. The Lyft dataset also includes map information of the road area, which can help understand the road layout.
A public dataset and evaluation metric benchmark needs to be established to further the development in this field. Two things need special attention.
- Long-tail corner cases. It is essential to pay attention to the detection effect of difficult samples to ensure the system’s reliability. However, long-tail data can differ in various regions and scenes, and collecting and labeling them requires time and technological expertise. It is worth exploring ways to balance small sample data and improve the learning effect.
- Output format. The definition and implementation of 3D drivable space are closely related to the downstream consumption logic and the design of autonomous driving systems. It is challenging to establish a unified standard in the industry, and it is not often used as a clear and independently defined module in academic research or public dataset competitions. Polygons may be a flexible enough format. The detected polygons across frames can be evaluated by object detection score and the shape consistency across time, as downstream requires consistent shape for precise vehicle maneuver.
- There is no universally accepted definition or evaluation metrics for drivable space yet in academia. Relabeling the public datasets is one possible way, but long-tail corner cases need to be enriched.
- 2D image spatial pixel level drivable space detection relies on depth information from IPM or other modules, but position error increases with distance. In 3D space, LiDAR offers high geometric accuracy but weak semantic classification. It can detect road edges and general obstacles using ground fitting and other methods, and can also identify unknown dynamic and static obstacles through raycasting freespace or Occupancy expressions.
- 2D-to-3D BEV perception methods are promising due to their cost-effectiveness and powerful representation capability. However, the lack of a standard for drivable space has resulted in various output formats. The output format depends on the requirements from downstream planning and control.
This is the second post on drivable space, focusing on recent academic progress. The first post explored the concept of drivable space. In our next post, we will discuss the current industrial applications of drivable space, including how it can be extended to general robotics beyond autonomous driving.
Note: All images in this blog post are either created by the author, or from academic papers publicly available. See captions for details.
- StixelNet: A Deep Convolutional Network for Obstacle Detection and Road Segmentation, BMVC 2015.
- Real-Time Category-Based and General Obstacle Detection for Autonomous Driving, ICCV 2018
- DS-Generator: Learning Collision-Free Space Detection from Stereo Images, IEEE/ASME Transactions on Mechatronics 2022
- RPP: Segmentation of Drivable Road Using Deep Fully Convolutional Residual Network with Pyramid Pooling, Cognitive Computation, 2018
- STSU: Structured Bird’s-Eye-View Traffic Scene Understanding from Onboard Images, ICCV 2021
- HDMapNet: An Online HD Map Construction and Evaluation Framework, Arxiv 2021
- PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images, Arxiv 2022
- CVT: Cross-view Transformers for real-time Map-view Semantic Segmentation, CVPR 2022
- Monolayout: Amodal scene layout from a single image, WACV 2020
- Darpa urban challenge 2007, ATZ worldwide 2008
- The DARPA urban challenge: autonomous vehicles in city traffic, springer, 2009.
- Gaussian-process-based real-time ground segmentation for autonomous land vehicles, Journal of Intelligent & Robotic Systems, 2014
- A Gaussian Process-Based Ground Segmentation for Sloped Terrains, ICRoM 2021
- LidarMTL: A Simple and Efficient Multi-task Network for 3D Object Detection and Road Understanding, Arxiv 2021
- Freespace Forecaster: Safe local motion planning with self-supervised freespace forecasting, CVPR 2021
- MotionNet: Joint perception and motion prediction for autonomous driving based on bird’s eye view maps, CVPR 2020
- PointMotionNet: Point-Wise Motion Learning for Large-Scale LiDAR Point Clouds Sequences, CVPR 2022
- MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction, ICLR 2023
- VectorMapNet: End-to-end Vectorized HD Map Learning, Arxiv 2022
- SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving, Arxiv 2023
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot