Vision-based End-to-End Driving

Various approaches to developing autonomous driving systems exist, including a modular approach, which may involve distinct components for perception, prediction, and planning for example, and an end-to-end approach, which may involve directly learning to map raw sensor data to driving actions. This latter approach is especially attractive to explore given the advancements in Large Language Models which bring world knowledge and reasoning capabilities.
The WOD Vision-based End-to-End Driving Challenge aims to foster research around long-tail scenarios using the end-to-end modeling approach. Unlike existing datasets, this challenge introduces a novel dataset comprised entirely of 5000 driving segments specifically curated to represent long-tail scenarios that drivers encounter in different environments, such as navigating construction zones during marathons, avoiding pedestrians falling off scooters, and maneuvering around unexpected obstacles on freeways. A mining analysis indicates that these events occur with a frequency of less than 0.003% in daily driving, highlighting the dataset's unique focus on more rare and interesting scenarios. This challenge provides a unique platform for the research community to train and, more importantly, rigorously test the robustness and generalization ability of end-to-end approaches on these infrequent but impactful scenarios.
Participants in this challenge are tasked with predicting the future 5-second waypoints for their autonomous driving agent in the bird-eye-view coordinates, given camera images captured before a particular driving moment, the historical pose of the agent, and routing information. This setup allows for evaluating the agent’s ability to make appropriate driving decisions in the face of a set of irregular events.
Submissions will be evaluated using the rater feedback metric (and tie-breaker ADE metric), which are designed to assess the quality of the predicted waypoints in these complex scenarios.
The dataset for this challenge consists of 5000 run segments, where each individual segment has a duration of 20 seconds. 2000 segments are designated for training that participants may use for the Challenge, with the full 20-second driving sequence and pose information provided. Participants are free to leverage any public datasets to train their models beside the provided training set. The remaining 3000 segments are for evaluation. For the evaluation set, participants will receive only the first 12 seconds of data leading up to a particular moment, while the subsequent 8 seconds are reserved for evaluating the predicted waypoints. The input camera data will consist of 8 camera images providing a 360 degree view around the driving agent.
This new challenge and associated long-tail dataset offer a valuable benchmark for the next generation of end-to-end autonomous driving agents, pushing the research community’s ability to assess robustness and generalization in the face of real-world driving complexities.
Leaderboard
Tutorial
Check this colab to follow with the provided tutorial to get familiar with the challenge.
Submit
Note: Challenge submissions are not yet open. We plan to accept submissions soon, so stay tuned for an announcement.
Participants must submit their predictions as serialized E2EDChallengeSubmission
protocol buffers. Each FrameTrajectoryPredictions
proto within the submission represents a single run segment prediction from the test set. These protos contain the predicted future trajectory of the ego vehicle. Specifically, each prediction must forecast the ego vehicle’s X and Y positions for the next 5 seconds, sampled at 4 Hz, resulting in a (20, 2) shaped trajectory. Important: The first prediction point must correspond to 0.25 seconds into the future; the current time step should not be included
To submit your entry to the leaderboard, upload your file as a serialized E2EDChallengeSubmission proto file compressed into a .tar.gz archive file. If the single proto is too large, you can shard them across multiple files where each file contains a subset of the predictions. Then tar and gzip them into a .tar.gz file before uploading.
To be eligible to participate in the Challenge, each individual/all team members must read and agree to be bound by the WOD Challenges Official Rules. You can only submit against the Test Set 3 times every 30 days. (Submissions that error out do not count against this total.)
Metrics
Leaderboard ranking for this challenge is by the Rater Feedback Score (see definition below) at 3 and 5 seconds, averaged over 11 different types of scenarios. The standard Average Distance Error (L2 error, see tutorial) between the highest rater scored trajectory and predicted trajectory will be used as the secondary metric.
Rater Feedback Score

As described on the WOD-E2E dataset page, we annotate 3 rater specified trajectories with a rater’s score in [0, 10], where 10 is great driving and 0 is the opposite. Given a model predicted trajectory, we match the prediction with the closest rater specified trajectory, similar to the Miss Rate metric described in the Motion Prediction and Interaction Prediction Challenges.
A trust region is defined for the region within a given lateral and longitudinal threshold of the rater specified trajectory at a given time T (T=3, 5s). If a predicted trajectory falls within the trust region, it is given the rater’s score of the corresponding (closest trust-region adjusted distance) rater specified trajectory.
The longitudinal and lateral directions and thresholds below are used to define trust regions:
Longitudinal and lateral thresholds. Note that the longitudinal threshold is 4 times larger than the original Miss Rate metric considering the camera depth uncertainty.
Lateral threshold \(\tilde{\tau}_{\mathrm{lat}}\) | Longitudinal threshold \(\tilde{\tau}_{\mathrm{lng}}\) | |
---|---|---|
\(t=3\) | \(1.0\) | \(4.0\) |
\(t=5\) | \(1.8\) | \(7.2\) |
Scale by the initial speed. The thresholds are scaled according to the initial speed of the rater specified trajectory. The scaling function is a piecewise linear function of the initial speed \(v\) (m/s):
$$
\mathrm{scale}(v) = \begin{cases} 0.5, & v<1.4m/s,\\
0.5+0.5 \times \dfrac{v-1.4}{11-1.4}, & 1.4m/s\le v<11m/s,\\
1, &v\ge11m/s. \end{cases}
$$Then, the final thresholds are determined by
$$
\tau_{\mathrm{lat}}(T, v) := \mathrm{scale}(v) \times \tilde{\tau}_{\mathrm{lat}}(T),
$$
$$
\tau_{\mathrm{lng}}(T, v) := \mathrm{scale}(v) \times \tilde{\tau}_{\mathrm{lng}}(T).
$$
If a predicted trajectory is outside of any trust regions, we assign a score exponentially lower than the one of the closest rater specified trajectory \(\bar{s}\), depending on the distance error \(\Delta\). Specifically,
$$
\mathrm{rater\_feedback\_score} = \bar{s} \times 0.1 ^ {\max \left( \max \left[ \frac{\Delta_{\mathrm{lng}}}{\tau_\mathrm{lng}}, \frac{\Delta_{\mathrm{lat}}}{\tau_\mathrm{lat}} \right] - 1, \ 0 \right)}
$$