Here is some information explaining how we have labeled the perception dataset and formatted the data.
The perception dataset contains independently-generated labels for lidar and camera data, not simply projections.
We provide 3D bounding box labels in lidar data. The lidar labels are 3D 7-DOF bounding boxes in the vehicle frame with globally unique tracking IDs.
The following objects have 3D labels: vehicles, pedestrians, cyclists, signs.
The bounding boxes have zero pitch and zero roll. Heading is the angle (in radians, normalized to [-π, π]) needed to rotate the vehicle frame +X axis about the Z axis to align with the vehicle's forward axis.
Each scene may include an area that is not labeled, which is called a “No Label Zone” (NLZ). These capture areas such as the opposite side of a highway. See our label specifications document for details. NLZs are represented as polygons in the global frame. These polygons are not necessarily convex. In addition to these polygons, each lidar point is annotated with a boolean to indicate whether it is in an NLZ or not.
Our metrics computation code requires the user to provide information about whether the prediction result is overlapping with any NLZ. Users can get this information by checking whether their prediction overlaps with any NLZ-annotated lidar points (on both 1st and 2nd returns).
We provide 2D bounding box labels in the camera images. The camera labels are tight-fitting, axis-aligned 2D bounding boxes with globally unique tracking IDs. The bounding boxes cover only the visible parts of the objects.
The following objects have 2D labels: vehicles, pedestrians, cyclists. We do not provide object track correspondences across cameras.
This section explains the coordinate systems, as well as the format of the lidar and camera data.
See data format proto for additional details.
We use the following coordinate systems in the dataset.
The origin of this frame is set to the vehicle position when the vehicle starts. It is an ‘East-North-Up’ coordinate frame. ‘Up(z)’ is aligned with the gravity vector, positive upwards. ‘East(x)’ points directly east along the line of latitude. ‘North(y)’ points towards the north pole.
The x-axis is positive forwards, y-axis is positive to the left, z-axis is positive upwards. A vehicle pose defines the transform from the vehicle frame to the global frame.
Each sensor comes with an extrinsic transform that defines the transform from the sensor frame to the vehicle frame.
The camera frame is placed in the center of the camera lens. The x-axis points down the lens barrel out of the lens. The z-axis points up. The y/z plane is parallel to the camera plane. The coordinate system is right handed.
The lidar sensor frame has the z-axis pointing upward with the x/y plane depending on the lidar position.
The lidar spherical coordinate system is based on the Cartesian coordinate system in lidar sensor frame. A point (x, y, z) in lidar Cartesian coordinates can be uniquely translated to a (range, azimuth, inclination) tuple in lidar spherical coordinates.
The dataset contains data from five lidars - one mid-range lidar (top) and four short-range lidars (front, side left, side right, and rear)
For the purposes of this dataset, the following limitations were applied to lidar data:
An extrinsic calibration matrix transforms the lidar frame to the vehicle frame. The mid-range lidar has a non-uniform inclination beam angle pattern. A 1D tensor is available to get the exact inclination of each beam.
The point cloud of each lidar is encoded as a range image. Two range images are provided for each lidar, one for each of the two strongest returns. It has 4 channels:
Lidar elongation refers to the elongation of the pulse beyond its nominal width. Returns with long pulse elongation, for example, indicate that the laser reflection is potentially smeared or refracted, such that the return pulse is elongated in time.
In addition to the basic 4 channels, we also provide another 6 channels for lidar to camera projection. The projection method used takes rolling shutter effect into account:
A range image represents a lidar point cloud in the spherical coordinate system based on the following rules:
The dataset contains images from five cameras associated with five different directions. They are front, front left, front right, side left, and side right.
One camera image is provided for each