One of the key qualities of a good driver is being able to anticipate and predict what others on the road might do. For example, what is the probability of another car merging into our lane or the cyclist in front of us making a left turn? The ability to accurately predict the intentions of other road users allows the Waymo Driver to make the safest possible decisions.
Predicting the behavior of others on the road is difficult: often, a holistic understanding of a scene and its context is required, including the width of road lanes, four-way intersection rules, traffic lights, and signs. In combination with live information from our powerful sensors, Waymo's custom-made, highly detailed maps provide this important semantic context to the Waymo Driver; however, we need more than sensor data and maps to make predictions. The behavior of other road users is often complicated and hard to capture with a set of map-derived traffic rules because driving patterns vary across different locations, and other road users may break these rules. Machine learning is a common tool used to model and reduce this complexity by enabling the system to learn new types of behavior.
The most popular way to incorporate highly detailed maps into behavior prediction models is by rendering the map into pixels and encoding the scene information, such as traffic signs, lanes, and road boundaries, with a convolutional neural network (CNN). However, this process requires a lot of compute and time. Additionally, processing maps as imagery makes it challenging to model long-range geometry, such as lanes merging ahead, which affects the quality of the predictions.
To address these pain-points and better predict others' behavior to make better decisions ourselves, we developed a new model, VectorNet, that provides more accurate behavior predictions while using less compute, compared to CNNs.
How VectorNet enables the Waymo Driver
Simplifying our highly detailed maps and sensor inputs to our “Abstracted World” to create better behavior predictions with less compute
Both map features and our sensor input can be simplified into either a point, a polygon, or a curve. For example, a lane boundary contains multiple control points that build a spline; a crosswalk is a polygon defined by several points; a stop sign is represented by a single point. Curves, polygons and points can all be approximately represented as polylines which contain multiple control points; polylines are further split into vector fragments. In this way, we can represent all the road features and the trajectories of the other objects as a set of such vectors. With this simplified view, we set out to design a network that could effectively process our sensor and map inputs.
We proposed a novel hierarchical graph neural network: in the first level (composed of polyline subgraphs), VectorNet gathers information within each polyline; in the second level (called global interaction graph), VectorNet exchanges information among polylines.
Through this process, the neural network captures the relationships between various vectors. These relationships occur when, for example, a car enters an intersection or a pedestrian approaches a crosswalk. Through learning such interactions between road features and object trajectories, VectorNet’s data-driven, machine learning-based approach allows us to better predict other agents' behavior by learning from different behavior patterns.
To further boost VectorNet’s capabilities and understanding of the real world, we trained the system to learn from context clues to make inferences about what could happen next around the vehicle to make improved behavior predictions. For example, important scene information can often be occluded while driving, such as foliage blocking a stop sign. When this happens to a human driver, they can draw upon past experiences to make inferences about the possibility of something happening even though they cannot see it. By randomly masking out map features during training, such as a stop sign at a four-way intersection and requiring the net to complete it, VectorNet can further improve the Waymo Driver’s understanding of the world around it and be prepared for the unexpected.
Validating the performance of VectorNet
When comparing VectorNet to ResNet, we see an improvement on VectorNet’s compute and displacement error when applying it to both Waymo’s and Argo’s datasets.
We validated the performance of VectorNet on the task of trajectory prediction. Compared with ResNet-18, one of the most advanced and widely used ConvNets, VectorNet achieves up to 18% better performance while using only 29% of the parameters and consuming just 20% of the computation when there are 50 agents per scene.
These improvements enable us to make better predictions creating a safer and smoother experience for our riders, and even parcels we carry on behalf of our local delivery partners. This will be especially beneficial as we expand to more cities where we will continue encountering new scenarios and behavior patterns. VectorNet will allow us to better adapt to these new areas, enabling us to learn more efficiently and effectively and helping us achieve our goal of delivering fully self-driving technology to more people in more places.
This collaboration between Waymo and Google was initiated and sponsored by Congcong Li and Drago Anguelov of Waymo. The work was conducted by Jiyang Gao, Yi Shen, and Hang Zhao of Waymo, and Chen Sun and Cordelia Schmid from Google. The team will share more details virtually about VectorNet at CVPR 2020 in June. Waymo is also holding their Workshop on Scalability in Autonomous Driving at CVPR where additional research work will be presented.
Join our team and help us build the World’s Most Experienced Driver™. Waymo is looking for talented software and hardware engineers, researchers, and out-of-the-box thinkers to help us tackle real-world problems, and make the roads safer for everyone. Come work with other passionate engineers and world-class researchers on novel and difficult problems—learn more at waymo.com/joinus.