MotionLM: Multi-Agent Motion Forecasting as Language Modeling

Authors

Ari Seff
Brian Cera
Dian Chen
Mason Ng
Aurick Zhou
Nigamaa Nayakanti
Khaled S. Refaat
Rami Al-Rfou
Benjamin Sapp

Abstract

Reliable forecasting of the future behavior of road agents is a critical component to safe planning in autonomous vehicles. Here, we represent continuous trajectories as sequences of discrete motion tokens and cast multi-agent motion prediction as a language modeling task over this domain. Our model, MotionLM, provides several advantages: First, it does not require anchors or explicit latent variable optimization to learn multimodal distributions. Instead, we leverage a single standard language modeling objective, maximizing the average log probability over sequence tokens. Second, our approach bypasses post-hoc interaction heuristics where individual agent trajectory generation is conducted prior to interactive scoring. Instead, MotionLM produces joint distributions over interactive agent futures in a single autoregressive decoding process. In addition, the model's sequential factorization enables temporally causal conditional rollouts. The proposed approach establishes new state-of-the-art performance for multi-agent motion prediction on the Waymo Open Motion Dataset, ranking 1st on the interactive challenge leaderboard.

Overview

MotionLM autoregressively generates sequences of discrete tokens for a set of agents to produce interactive trajectory forecasts. At each timestep, a token is sampled for each agent from a finite vocabulary and appended to the global sequence.

Bypassing geometric anchors and latent variable optimization, multimodal distributions emerge solely via per-step sampling. Meanwhile, the training objective is kept simple with minimal assumptions — just next-token prediction.

The resulting model can perform marginal, joint, and conditional forecasting. MotionLM establishes new state-of-the-art performance on both the Waymo Open Motion Dataset motion prediction and interaction prediction benchmarks.

Examples

Marginal vs. Joint

Attention-based interactive modeling during decoding allows for scene-level consistency. While marginal (independent per agent) predictions may lead to unrealistic overlap (left), joint predictions exhibit appropriate reactions across agents (right).

Marginal	Joint

Marginal vs. Conditional

When conditioning on a query agent trajectory (magenta), the predicted agent trajectory (cyan) can appropriately respond.

Marginal	Conditional
The marginal prediction for the pedestrian (cyan) crosses the street as the vehicle turns, leading to a collision.	When conditioning on the turning vehicle’s trajectory (magenta), the pedestrian is predicted to yield.
The marginal prediction for the modeled vehicle (cyan) collides with the lead vehicle.	When conditioning on the lead vehicle’s trajectory (magenta), the modeled vehicle (cyan) comes to an appropriate stop.