3D Camera-Only Detection
Overview
Given one or more images from multiple cameras, produce a set of 3D upright boxes for the visible objects in the scene. This is a variant of the original 3D Detection Challenge without the availability of LiDAR sensing. We use the current dataset as the training set and the validation set. We provide a new test set for this challenge that consists of camera data, but no LiDAR data. Following the original 3D Detection Challenge, the ground truth bounding boxes are limited to a maximum range of 75 meters.
The cameras mounted on our autonomous vehicles trigger sequentially in a clockwise order, where each camera rapidly scans the scene sideways using a rolling shutter. To help users reduce the effects of this rolling shutter timing, we include additional 3D bounding boxes in the dataset that are synchronized with the cameras. Specifically, for each object in the scene, we provide a bounding box for the time the center of the object is captured by the camera that best perceives the object. To this end, we solve an optimization problem that takes into account the rolling shutter, the motion of the autonomous vehicle and the motion of the object. For each object, we store the adjusted bounding box in the field camera_synced_box
and indicate the corresponding camera in the field most_visible_camera_name
. To reduce the effects of the rolling shutter camera setup on the users, our evaluation server compares the predicted bounding boxes with the ground truth bounding boxes that are synchronized with the rolling shutter cameras. We therefore advise users to train their models on the bounding boxes stored in the field camera_synced_box
and then produce bounding boxes without taking the rolling shutter into further consideration. We believe that this is a user-friendly approximation of the rolling shutter. Furthermore, in our sensor setup, even when an object is detected in adjacent cameras, the detections will have similar shutter capture times and are likely to match with the same 3D ground truth bounding box. For completeness, we note that more complex alternative solutions that explicitly consider the rolling shutter camera geometry are also possible.
Some objects that are collectively visible to the LiDAR sensors may not be visible to the cameras. To help users filter such fully occluded objects, we provide a new field num_top_lidar_points_in_box
, which is similar to the existing field num_lidar_points_in_box
, but only accounts for points from our top LiDAR. Since the cameras are collocated with the top LiDAR, objects are likely occluded if no top LiDAR points are contained in their ground truth 3D bounding box. Our evaluation server uses the same heuristic to ignore fully occluded objects.
For training, users are allowed to utilize the available camera and LiDAR data. For inference, users rely on the camera data, but there is no LiDAR data available. Furthermore, when computing predictions for a frame, users are only allowed to make use of sensor data from that frame and all previous frames, but no sensor data from any subsequent frames. The leaderboard ranks submissions according to a new metric, LET-3D-APL. See our paper on the metric for more details.
Leaderboard
Disqualified from the 2022 Waymo Open Dataset Challenge