Skip to main content

Abstract

We present Block-NeRF, a variant of Neural Radiance Fields that can represent large-scale environments. Specifically, we demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to decompose the scene into individually trained NeRFs. This decomposition decouples rendering time from scene size, enables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment. We adopt several architectural changes to make NeRF robust to data captured over months under different environmental conditions. We add appearance embeddings, learned pose refinement, and controllable exposure to each individual NeRF, and introduce a procedure for aligning appearance between adjacent NeRFs so that they can be seamlessly combined. We build a grid of Block-NeRFs from 2.8 million images to create the largest neural scene representation to date, capable of rendering an entire neighborhood of San Francisco.

Waymo Block-NeRF Dataset

You can download the Waymo Block-NeRF Dataset to reproduce our Block-NeRF results on the San Francisco Mission Bay dataset and to try your own scene reconstruction techniques for a direct comparison. The dataset consists of 100 seconds of driving recorded by 12 cameras, for a total of approximately 12,000 images. You can access the dataset here.

The data is provided in the form of individual tf.train.Examples corresponding to each image, stored in TFRecord containers with GZIP compression, and includes the following fields:

  • image_hash: Unique hash of the image data.
  • cam_idx: Numeric ID of the camera that took the given image.
  • equivalent_exposure: A value that is equivalent to a measure of camera exposure.
  • height: Image height in pixels.
  • width: Image width in pixels.
  • image: PNG-encoded RGB image data of shape [H, W, 3].
  • ray_origins: Camera ray origins in 3D of shape [H, W, 3]. Has pixel-wise correspondence to “image”.
  • ray_dirs: Normalized camera ray directions in 3D of shape [H, W, 3]. Has pixel-wise correspondence to “image”.
  • intrinsics: Camera intrinsic focal lengths (f_u, f_v) of shape [2].

Additionally, we provide a mask field for the validation set only, which contains approximate semantic segmentation masks indicating the presence of movable objects of shape [H, W, 1]. Note that this field is populated using an off-the-shelf 2D semantic segmentation method. The predictions therefore only provide rough guidance for masking out movable objects.

Reconstructions of San Francisco

Appearance modulation

Supplementary results

Links