4D-Net boosts autonomous driving vision capabilities by fusing point cloud, camera, and time data
Designed to address the problem of accurately detecting objects — like other vehicles and pedestrians — at distance, Google and Waymo's 4D-Net offers a novel and generalizable approach to sensor fusion with some impressive results.
The key to safe, reliable autonomous vehicles — above even how smart its on-board self-driving system is at its heart — is likely to be found in how efficiently it can process sensor data. Just as with visual acuity tests for human drivers, it’s vital to know that an autonomous vehicle system can spot hazards and react accordingly — no matter how small or distant the problem may be.
Traditional two-dimensional camera systems and three-dimensional sensors, like LiDAR (Light Detection and Ranging), may not go far enough for full reliability and safety, leading a team at Google and Alphabet’s autonomous vehicle subsidiary Waymo to look into the fourth dimension: 4D-Net, an approach to object detection which fuses two- and three-dimensional data with the fourth dimension, time, with the claim of significantly improved performance.
“This is the first attempt to effectively combine both types of sensors, 3D LiDAR point clouds and onboard camera RGB images, when both are in time,” claim Google Research scientists and paper co-authors AJ Piergiovanni and Anelia Angelova in a joint statement on the work. “We also introduce a dynamic connection learning method, which incorporates 4D information from a scene by performing connection learning across both feature representations.”
The 4D-Net approach stems from a simple observation: The majority of modern sensor-equipped vehicles include two- and three-dimensional sensors, typically in the form of multiple camera modules and LiDAR, data from which is gathered across a period of time — but very few efforts have been made to gather everything in a single place and process it as a whole.
4D-Net address this gap, blending three-dimensional point-cloud data with visible-light camera imagery while blending in a time element by processing a series of each captured over a set period. The secret to its success: A novel learning technique which can autonomously find and build connections between the data, fusing it at different levels dynamically in order to boost performance over any of the data feeds alone.
“Images in time are […] highly informative, and complementary to both a still image and PCiT [Point Cloud in Time],” the researchers explain of the benefit to the approach. “In fact, for challenging detection cases, motion can be a very powerful clue. While motion can be captured in 3D, a purely PC [Point Cloud]-based method might miss such signals simply because of the sensing sparsity” — the same issue, incidentally, which means distant or small objects can be missed by a LiDAR sensor but picked up on visible-light camera systems or the driver’s naked eye.
Machine learning through time
To handle both types of data, the team turns to a range of pre-processing steps. 3D point cloud data is run through PointPillars, a system for converting the data into a pseudo-image which can be further processed using a convolutional neural network (CNN) designed for two-dimensional data, with a time indicator added to each point to create a denser representation including motion. Conversion into fixed-sized representations, effectively sub-sampling the point cloud, is also used — an approach which densifies the point cloud where data is sparse and sparsifies it where data is dense, thus boosting performance at long ranges.
The two-dimensional camera data, meanwhile, is processed into feature maps through Tiny Video Networks, then the data projected to align the 3D points with their corresponding points on the 2D imagery — a process which assumes “calibrated and synchronized sensors.” For point cloud data which lies outside the view of the vehicle’s cameras, a vector of zeroes is applied.
The truly smart part of the 4D-Net system, though, comes in the form of its connection architecture search — the means by which it is able to extract the most, and most suitable, information from the fused data. A one-shot lightweight differentiable architecture search finds related information in both 3D and in time and connects it across the two different sensing modalities — and learns the combination of feature representations at various abstraction levels for both sensors.
“[This] is very powerful,” the team explains, “as it allows to learn the relations between different levels of feature abstraction and different sources of features.” To further tweak the approach for autonomous vehicles, the team modified the connections to be dynamic, based on the concept of self-attention mechanisms, allowing the network to dynamically select particular visible-light data blocks for information extraction — meaning it can learn how and where to select features based on variable input.
Testing both a single-stream and a multi-stream variant of the system, the latter pulling in additional input streams in the form of still-image and video feeds running at different resolutions, the team claims some impressive gains over rival state-of-the-art approaches.
Tested against the Waymo Open Dataset, 4D-Net improved on average precision (AP) across all tested rival approaches. While its performance proved, on average, weaker at shorter distances, its ability to recognize more distant objects — particularly the 50-meter-plus range — is reportedly unsurpassed, particularly when running in multi-stream mode.
“We demonstrate improved state-of-the-art performance and competitive inference run-times,” the team concludes, “despite using 4D sensing and both modalities in time. Without loss of generality, the same approach can be extended to other streams of RGB images, e.g., the side cameras providing critical information for highly occluded objects, or to diverse learnable feature representations for PC [Point Cloud] or images, or to other sensors.”
The researchers suggest that the 4D-Net approach can also be used outside the field of autonomous driving, wherever there’ a need to capture different aspects of the same domain by aligning audio, video, text, and image data automatically.
The team’s work was presented at the International Conference on Computer Vision (ICCV) 2021, and has been made available under open-access terms. A supporting write-up by AJ Piergiovanni and Anelia Angelova is available on the Google AI Blog. The researchers have pledged to make their code available under an open-source license, but it had not been published at the time of writing.