Skip to main content

MotionLM: Multi-Agent Motion Forecasting as Language Modeling

Authors

  • Ari Seff

  • Brian Cera

  • Dian Chen

  • Mason Ng

  • Aurick Zhou

  • Nigamaa Nayakanti

  • Khaled S. Refaat

  • Rami Al-Rfou

  • Benjamin Sapp

    Abstract

    Reliable forecasting of the future behavior of road agents is a critical component to safe planning in autonomous vehicles. Here, we represent continuous trajectories as sequences of discrete motion tokens and cast multi-agent motion prediction as a language modeling task over this domain. Our model, MotionLM, provides several advantages: First, it does not require anchors or explicit latent variable optimization to learn multimodal distributions. Instead, we leverage a single standard language modeling objective, maximizing the average log probability over sequence tokens. Second, our approach bypasses post-hoc interaction heuristics where individual agent trajectory generation is conducted prior to interactive scoring. Instead, MotionLM produces joint distributions over interactive agent futures in a single autoregressive decoding process. In addition, the model's sequential factorization enables temporally causal conditional rollouts. The proposed approach establishes new state-of-the-art performance for multi-agent motion prediction on the Waymo Open Motion Dataset, ranking 1st on the interactive challenge leaderboard.

    Overview

    • Overall framework of MotionLM displaying continuous trajectories represented as sequences of discrete motion tokens.

    MotionLM autoregressively generates sequences of discrete tokens for a set of agents to produce interactive trajectory forecasts. At each timestep, a token is sampled for each agent from a finite vocabulary and appended to the global sequence.

    • MotionLM architecture diagram displaying the autoregressive transformer decoder sampling sequences of motion tokens.

    Bypassing geometric anchors and latent variable optimization, multimodal distributions emerge solely via per-step sampling. Meanwhile, the training objective is kept simple with minimal assumptions — just next-token prediction.

    The resulting model can perform marginal, joint, and conditional forecasting. MotionLM establishes new state-of-the-art performance on both the Waymo Open Motion Dataset motion prediction and interaction prediction benchmarks.

    Examples

    Marginal vs. Joint

    Attention-based interactive modeling during decoding allows for scene-level consistency. While marginal (independent per agent) predictions may lead to unrealistic overlap (left), joint predictions exhibit appropriate reactions across agents (right).

    MarginalJoint
    Abstract depiction of marginal predictions for road agents leading to unrealistic overlap/collisions.
    Abstract depiction of joint predictions for road agents leading to realistic interactions.
    Abstract depiction of marginal predictions for road agents leading to unrealistic overlap/collisions.
    Abstract depiction of joint predictions for road agents leading to realistic interactions.
    Abstract depiction of marginal predictions for road agents leading to unrealistic overlap/collisions.
    Abstract depiction of joint predictions for road agents leading to realistic interactions.

    Marginal vs. Conditional

    When conditioning on a query agent trajectory (magenta), the predicted agent trajectory (cyan) can appropriately respond.

    MarginalConditional
    Abstract depiction of a marginal prediction for a single road agent.

    The marginal prediction for the pedestrian (cyan) crosses the street as the vehicle turns, leading to a collision.

    When conditioning on the turning vehicle’s trajectory (magenta), the pedestrian is predicted to yield.

    When conditioning on the turning vehicle’s trajectory (magenta), the pedestrian is predicted to yield.

    Abstract depiction of a marginal prediction for a single road agent.

    The marginal prediction for the modeled vehicle (cyan) collides with the lead vehicle.

    Abstract depiction of a marginal prediction for a single road agent, conditioned on a query trajectory for a nearby agent.

    When conditioning on the lead vehicle’s trajectory (magenta), the modeled vehicle (cyan) comes to an appropriate stop.