Challenge

2D Video Panoptic Segmentation

Overview

Given a panoramic sequence of camera images across five cameras and over time, produce a set of panoptic segmentation labels for each pixel, where the instance labels are consistent across all images in the sequence.

This Challenge is a variant of panoptic segmentation, which produces a set of instance and semantic labels for each pixel, and extends it to panoramic video panoptic segmentation, where the instance labels are tracked across cameras and over time.

For this Challenge, we will use the 2D Video Panoptic Segmentation Dataset, which provides panoramic video panoptic segmentation labels for a subset of the WOD Perception Dataset. For training, we provide ground truth panoptic labels with tracked instance ids over 5 temporal frame segments over all 5 cameras. The dataset consists of a subsample of the full Perception Dataset, and the list of frames (represented as (context_name, timestamp) tuples) can be found in tutorial/2d_pvps_training_frames.txt.

Examples of how to use the dataset can be found in tutorial_2d_pvps.ipynb.

For this Challenge, please use at least v1.4.2 of the dataset, which adds the num_cameras_covered field to the CameraSegmentationLabel proto, and is used for the weighted Segmentation and Tracking Quality (wSTQ) metric.

Leaderboard

Note: the rankings displayed on this leaderboard may not accurately reflect the final rankings for this Challenge.

Submit

To submit your entry to the leaderboard, upload your file in the format specified in the CameraSegmentationSubmission proto. In addition to the predicted panoptic segmentation labels, users have the option of providing inference runtime information and the frame rate used at inference (0 = every frame, 1 = every other frame).

For the 2D Video Panoptic Segmentation Challenge, your submission file should be a binary file of the CameraSegmentationSubmission proto. See the tutorial_2d_pvps.ipynb for an example. We also provide util functions to generate the proto in camera_segmentation_utils.py.

Valid submissions must provide predictions for all five images in each frame specified in tutorial/2d_pvps_validation_frames.txt for the validation set and tutorial/2d_pvps_test_frames.txt for the test set.

To be eligible to participate in the Challenge, each individual/all team members must read and agree to be bound by the WOD Challenges Official Rules.

You can only submit against the Test Set 3 times every 30 days. (Submissions that error out do not count against this total.)

You must be signed in to upload submissions. Please sign in to continue.

Baseline

We provide as a baseline method a ViP DeepLab model with a ResNet-50 backbone trained on all temporal pairs of images in the training set for each camera individually. To fuse instances between cameras, we compute the panorama from the images, and merge instances based on high IoU overlap in the panorama image. This corresponds to the View model with panoptic stitching over cameras as described in our paper Waymo Open Dataset: Panoramic Video Panoptic Segmentation. Please refer to the paper for detailed implementation details.

Note that there are discrepancies between the baseline here and the original paper. In the evaluation for the original paper, instances with the is_tracked field were not masked out, resulting in lower wAQ scores. This has been resolved for the challenge.

Metric

Leaderboard ranking for this challenge is by the weighted Segmentation and Tracking Quality (wSTQ) score.

The original Segmentation and Tracking Quality score balances segmentation and tracking performance, as a combination of the Association Quality (AQ) for tracking classes, and Segmentation Quality (SQ) for semantic classes, and is defined as:

\begin{align} STQ &= (AQ \times SQ)^{\frac{1}{2}}, \end{align} \begin{align} AQ &= \frac{1}{|\mathbf{G}|} \sum_{z_g \in \mathbf{G}} \frac{1}{|g_\text{id}(z_g)|}\sum_{z_f, |z_f \cap z_g| \neq \emptyset } \text{TPA}(z_f, z_g) \times \text{IoU}_\text{id}(z_f, z_g), \end{align} \begin{align} SQ &= \frac{1}{|\mathbf{C}|} \sum_{c \in \mathbf{C}} \frac{f_\text{sem}(c) \cap g_\text{sem}(c)}{f_\text{sem}(c) \cup g_\text{sem}(c)} \end{align}

Where:

G is the set of unique groundtruth instances across the entire sequence.
f and g are the prediction and groundtruth mappings, respectively.
\(z_{f}\) and \(z_{g}\) are predicted and groundtruth instance masks, respectively.
C is the set of semantic classes.
TPA is True Positive Associations for a given instance, defined as \(TPA(z_{f}, z_{g}) = |f_{\text{id}}(z_{f}) \cap g_{\text{id}}(z_{g})|\)

The wSTQ score used for this challenge is a slight modification of STQ, where we downweight each pixel’s contribution by the number of cameras covering that pixel (i.e. the weight is 1 / num_cameras_covered). For the Waymo Open Dataset, this is in {1, 2}, and is provided as num_cameras_covered in each CameraSegmentationLabel proto.

Example code to call this metric can be found in wdl_limited/camera_segmentation/camera_segmentation_metrics.py and in the pvps tutorial in tutorial/tutorial_2d_pvps.ipynb.

For more information on this loss, please refer to our paper: Waymo Open Dataset: Panoramic Panoptic Video Segmentation. For more information about the original STQ metric, refer to: STEP: Segmenting and Tracking Every Pixel.

Rules regarding awards

Please see the Waymo Open Dataset Challenges Official Rules here.