Challenge

Vision-based End-to-End Driving

Various approaches to developing autonomous driving systems exist, including a modular approach, which may involve distinct components for perception, prediction, and planning for example, and an end-to-end approach, which may involve directly learning to map raw sensor data to driving actions. This latter approach is especially attractive to explore given the advancements in Large Language Models which bring world knowledge and reasoning capabilities.

The WOD Vision-based End-to-End Driving Challenge aims to foster research around long-tail scenarios using the end-to-end modeling approach. Unlike existing datasets, this challenge introduces a novel dataset comprised entirely of 4000 driving segments specifically curated to represent long-tail scenarios that drivers encounter in different environments, such as navigating construction zones during marathons, avoiding pedestrians falling off scooters, and maneuvering around unexpected obstacles on freeways. A mining analysis indicates that these events occur with a frequency of less than 0.003% in daily driving, highlighting the dataset's unique focus on more rare and interesting scenarios. This challenge provides a unique platform for the research community to train and, more importantly, rigorously test the robustness and generalization ability of end-to-end approaches on these infrequent but impactful scenarios.

Participants in this challenge are tasked with predicting the future 5-second waypoints for their autonomous driving agent in the bird-eye-view coordinates, given camera images captured before a particular driving moment, the historical pose of the agent, and routing information. This setup allows for evaluating the agent’s ability to make appropriate driving decisions in the face of a set of irregular events.

Submissions will be evaluated using the rater feedback metric (and tie-breaker ADE metric), which are designed to assess the quality of the predicted waypoints in these complex scenarios.

The dataset for this challenge consists of 4021 run segments, where each individual segment has a duration of 20 seconds. Of these, 2037 segments are designated for training and 479 segments are designated for validation that participants may use for the Challenge, with the full 20-second driving sequence and pose information provided. Participants are free to leverage any datasets publicly available to the research/academic community to train their models in addition to the provided training set. The remaining segments are for testing. For the testing set, participants will receive only the first 12 seconds of data leading up to a particular moment, while the subsequent 8 seconds are reserved for evaluating the predicted waypoints. The input camera data will consist of 8 camera images providing a 360 degree view around the driving agent.

This new challenge and associated long-tail dataset offer a valuable benchmark for the next generation of end-to-end autonomous driving agents, pushing the research community’s ability to assess robustness and generalization in the face of real-world driving complexities.

Leaderboard

Tutorial

Check this colab to follow with the provided tutorial to get familiar with the challenge.

Submit

Participants must submit their predictions as serialized E2EDChallengeSubmission protocol buffers. Each FrameTrajectoryPredictions proto within the submission represents a prediction for a single frame from a run segment in the test set. We will provide a json specifying which frames the submission should cover. These FrameTrajectoryPredictions protos contain the predicted future trajectory of the ego vehicle. Specifically, each prediction must forecast the ego vehicle’s X and Y positions for the next 5 seconds, sampled at 4 Hz, resulting in a (20, 2) shaped trajectory. Important: The first prediction point must correspond to 0.25 seconds into the future; the current time step should not be included

To submit your entry to the leaderboard, upload your file as a serialized E2EDChallengeSubmission proto file compressed into a .tar.gz archive file. If the single proto is too large, you can shard them across multiple files where each file contains a subset of the predictions. Then tar and gzip them into a .tar.gz file before uploading.

For clarification, you may use automated labelling methods such as MLLMs to augment the training data in the dataset for the Vision-based End-to-End Driving Challenge. You can only submit against the Test Set 6 times every 30 days. (Submissions that error out do not count against this total.)

You must be signed in to upload submissions. Please sign in to continue.

Metrics

Leaderboard ranking for this challenge is by the Rater Feedback Score (see definition below) at 3 and 5 seconds, averaged over 11 different types of scenarios. The standard Average Distance Error (L2 error, see tutorial) between the highest rater scored trajectory and predicted trajectory will be used as the secondary metric.

Rater Feedback Score (RFS)

As described on the WOD-E2E dataset page, we annotate 3 rater specified trajectories with a rater’s score in [0, 10], where 10 is great driving and 0 is the opposite. Given a model predicted trajectory, we match the prediction with the closest rater specified trajectory, similar to the Miss Rate metric described in the Motion Prediction and Interaction Prediction Challenges.

A trust region is defined for the region within a given lateral and longitudinal threshold of the rater specified trajectory at a given time T (T=3, 5s). If a predicted trajectory falls within the trust region, it is given the rater’s score of the corresponding (closest trust-region adjusted distance) rater specified trajectory.

The longitudinal and lateral directions and thresholds below are used to define trust regions:

Longitudinal and lateral thresholds. Note that the longitudinal threshold is 4 times larger than the original Miss Rate metric considering the camera depth uncertainty.

	Lateral threshold $\tilde{\tau}_{\mathrm{lat}}$	Longitudinal threshold $\tilde{\tau}_{\mathrm{lng}}$
$t=3$	$1.0$	$4.0$
$t=5$	$1.8$	$7.2$

Scale by the initial speed. The thresholds are scaled according to the initial speed of the rater specified trajectory. The scaling function is a piecewise linear function of the initial speed $v$ (m/s):
$$
\mathrm{scale}(v) = \begin{cases} 0.5, & v<1.4m/s,\\
0.5+0.5 \times \dfrac{v-1.4}{11-1.4}, & 1.4m/s\le v<11m/s,\\
1, &v\ge11m/s. \end{cases}
$$
Then, the final thresholds are determined by
$$
\tau_{\mathrm{lat}}(T, v) := \mathrm{scale}(v) \times \tilde{\tau}_{\mathrm{lat}}(T),
$$
$$
\tau_{\mathrm{lng}}(T, v) := \mathrm{scale}(v) \times \tilde{\tau}_{\mathrm{lng}}(T).
$$

If a predicted trajectory is outside of any trust regions, we assign a score exponentially lower than the one of the closest rater specified trajectory $\bar{s}$, depending on the distance error $\Delta$. Specifically,
$$
\mathrm{rater\_feedback\_score} = \bar{s} \times 0.1 ^ {\max \left( \max \left[ \frac{\Delta_{\mathrm{lng}}}{\tau_\mathrm{lng}}, \frac{\Delta_{\mathrm{lat}}}{\tau_\mathrm{lat}} \right] - 1, \ 0 \right)}
$$
Finally, we assign a floor score of 4 to predicted trajectories outside trust regions.

	Lateral threshold \(\tilde{\tau}_{\mathrm{lat}}\)	Longitudinal threshold \(\tilde{\tau}_{\mathrm{lng}}\)
\(t=3\)	\(1.0\)	\(4.0\)
\(t=5\)	\(1.8\)	\(7.2\)