Wayformer: Motion Forecasting via Simple & Efficient Attention Networks

Authors

Nigamaa Nayakanti
Rami Al-Rfou
Aurick Zhou
Kratarth Goel
Khaled S. Refaat
Benjamin Sapp

Abstract

Motion forecasting for autonomous driving is a challenging task because complex driving scenarios result in a heterogeneous mix of static and dynamic inputs. It is an open problem how best to represent and fuse information about road geometry, lane connectivity, time-varying traffic light state, and history of a dynamic set of agents and their interactions into an effective encoding. To model this diverse set of input features, many approaches proposed to design an equally complex system with a diverse set of modality specific modules. This results in systems that are difficult to scale, extend, or tune in rigorous ways to trade off quality and efficiency. In this paper, we present Wayformer, a family of attention based architectures for motion forecasting that are simple and homogeneous. Wayformer offers a compact model description consisting of an attention based scene encoder and a decoder. In the scene encoder we study the choice of early, late and hierarchical fusion of input modalities. For each fusion type we explore strategies to trade off efficiency and quality via factorized attention or latent query attention. We show that early fusion, despite its simplicity of construction, is not only modality agnostic but also achieves state-of-the-art results on both Waymo Open Motion Dataset (WOMD) and Argoverse leaderboards, demonstrating the effectiveness of our design philosophy.

Overview

Fig 1: The Wayformer architecture as a pair of encoder/decoder Transformer networks. This model takes multimodal scene data as input and produces multimodal distribution of trajectories.

In this work:

We design a family of models with two basic primitives: a self-attention encoder, where we fuse one or more modalities across temporal and spatial dimensions, and a cross-attention decoder, where we attend to driving scene elements to produce a diverse set of trajectories.
We study three variations of the scene encoder that differ in how and when different input modalities are fused.
To keep our proposed models within practical real time constraints of motion forecasting, we study two common techniques to speed up self-attention: factorized attention and latent query attention.
We achieve state-of-the-art results on both WOMD and Argoverse challenges

Fig. 2: Wayformer scene encoder fusing multimodal inputs at different stages. Late fusion dedicates an attention encoder per modality while early fusion processes all inputs within one cross modal encoder. Finally, hierarchical fusion combines both the approaches.

Examples:

This is an interesting video from a recent SF ride where the Waymo Driver safely stopped for a cyclist coming down the wrong way from behind an occlusion—nice example of what we encounter daily and how our 5th-gen hardware & software work together to navigate complex dense urban situations