Skip to main content
Waymo Open Dataset
Challenge

Interaction Prediction

Overview

Given agents' tracks for the past 1 second on a corresponding map, predict the joint future positions of 2 interacting agents for 8 seconds into the future. The ground truth future data for the interactive test set is hidden from challenge participants. As such, the test sets contain only 1 second of history data. The validation sets contain the ground truth future data for use in model development. In addition, the test and validation sets provide 2 interacting object tracks in the scene to be predicted. These are selected to include interesting behavior and a balance of object types.

We are not running a new Interaction Prediction challenge in 2024. However we changed the metrics definition, and are accepting new submissions.

Leaderboard

Submit

Submissions for this version of the challenge are closed. You can submit to the 2024 version of the Interaction Prediction challenge.

Metrics

Leaderboard ranking for this challenge is by the average mAP across evaluation times (3, 5, and 8 seconds) averaged over the individual results for all object types. Miss rate will be used as a secondary metric.

All metrics described below are computed by first bucketing all object pairs into object type. For joint predictions the least common type of any of the objects is used (frequencies : vehicle > pedestrian > cyclist). The metrics are then computed per type. The metrics (ADE, FDE, Miss rate, Overlap rate, and mAP) are all computed at 3, 5, and 8 second timestamps.

Definitions

  • Let A be a set of N agents.

  • Let G be a subset of M agents.

  • Let K be the number of predicted future trajectories.

  • Let T be the number of time steps per trajectory.

  • G is associated with a joint future trajectory distribution

\begin{equation} \big\{(l_{G_j}^{i} S_{G_j}^{i} |_{j=1}^{M} )\big\}_{i=1}^{K} \end{equation}

Where \(l_{G_j}^{i}\) is an un-normalized likelihood for joint prediction i.
Where \(S_{G_j}^{i}\) is the predicted trajectory for the jth agent of joint prediction i. We will call this set of K joint predictions for M agents a multi-modal joint prediction.

minADE

Minimum Average Displacement Error

Let \(\hat{s}_{G_j}^{k}\) be the ground truth for the jth agent.
The minADE metric computes the mean of the l2 norm between the ground truth for all agents in G and the closest joint prediction.

\begin{equation} \mbox{minADE}(G) = \frac{1}{M} \min_i \sum_{j=1}^{M} \frac{1}{T} \sum_{t=1}^{T}||\hat{s}_{G_j}^{t} - s_{G_j}^{it}||_2 \end{equation}

Where T is the last prediction time step to include in the metric.

minFDE

Minimum Final Displacement Error

The minFDE metric is equivalent to evaluating the minADE metric at a single time step T:

\begin{equation} \mbox{minFDE}(G) = \frac{1}{M} \min_i \sum_{j=1}^{M} ||\hat{s}_{G_j}^{T} - s_{G_j}^{iT}||_2 \end{equation}

Miss Rate

A miss is defined as a multi-modal joint prediction where none of the individual K joint predictions contain trajectories for all M objects in the group that are within a given lateral and longitudinal threshold of the ground truth trajectory at a given time T.

I.e. For all agents in prediction i, the displacement vector at time T is rotated into the agent coordinate frame.

\begin{equation} D_j^i = (\hat{s}_{G_{j}}^{iT} - s_{G_{j}}^{iT}) \cdot R_j^T \end{equation}

where \(R_j\) is a rotation matrix to align a unit x vector with the jth agent’s ground truth axis at time T.

If for all agents j in any prediction i, \(d_{jy}^i\) < Thresholdlat and \(d_{jx}^i\) < Thresholdlon then the multi-modal joint prediction is considered a correct prediction rather than a miss, otherwise a single miss is counted for the multi-modal joint prediction. The miss rate is calculated as the total number of misses divided by the total number of multi-modal joint predictions.

The thresholds change with both velocity and measurement step T as follows:

Thresholdlat

Thresholdlon

T = 3 seconds

1

2

T = 5 seconds

1.8

3.6

T = 8 seconds

3

6

The thresholds are also scaled according to the initial speed of the agent. The scaling function is a piecewise linear function of the initial speed vi:

\begin{equation}
\mbox{Scale}(V_i) =
\begin{cases}
0.5 & \text{if $V_i <$ 1.4 m/s}\\
0.5+0.5\alpha & \text{if 1.4 m/s $< V_i <$ 11 m/s}\\
1 & \text{if $V_i >$ 11 m/s}
\end{cases}
\end{equation}

where 𝝰=(vi - 1.4) / (11 - 1.4)

The thresholds are calculated as:

Thresholdlat(vi, T) = Scale(vi) * Thresholdlat(T)
Thresholdlon(vi, T) = Scale(vi) * Thresholdlon(T)

Overlap Rate

The overlap rate is computed by taking the highest confidence joint prediction from each multi-modal joint prediction. If any of the M agents in the joint predicted trajectories overlap at any time with any other objects that were visible at the prediction time step (compared at each time step up to T) or with any of the jointly predicted trajectories, it is considered a single overlap. The overlap rate for this challenge is computed as the total number of overlaps divided by the total number of multi-modal joint predictions.

mAP

The first step to computing the mAP metric is determining a trajectory bucket for the ground truth of the first object to be predicted (the selection is arbitrary). The buckets include straight, straight-left, straight-right, left, right, left u-turn, right u-turn, and stationary. For each bucket, the following is computed.

Using the same definition of a miss as defined above, any joint predictions classified as a miss are assigned a false positive and any that are not considered a miss are assigned a true positive. Consistent with object detection mAP metrics, only one true positive is allowed for each object - it is assigned to the highest confidence prediction, all other predictions in the multi-modal joint prediction are assigned a false positive. True positives and false positives are stored along with their confidences in a list per bucket. To compute the metric, the bucket entries are sorted and a P/R curve is computed.

As in the above simple example for joint predictions on two agents, the white arrows are ground truth trajectories, and the colored arrows are predicted trajectories with confidence scores while the trajectories of the same color are paired. For object 1 and object 2, only the blue trajectory is within the given lateral and longitudinal threshold compared to the ground truth. The precision and recall based on sorting the confidence stores can be computed as:

Rank (confidence scores)

Precision

Recall

0.6

0%

0%

0.5

50%

100%

0.2

33.3%

100%

0.1

25%

100%

While specific models can produce probabilities over the specific trajectories, for the purpose of evaluation and in this example, we are only looking at the scores' relative ranking and do not require that they sum to 1.

The mAP metric is computed using the interpolated precision values as described in "The PASCAL Visual Object Classes (VOC) Challenge" (Everingham, 2009, p. 11) but uses the newer method that includes all samples in the computation consistent with the current PASCAL challenge metrics.

After an mAP metric has been computed for all buckets, an average across all buckets is computed as the overall mAP metric.

Rules Regarding Awards

See the rules on the Challenges Overview page.