SceneCrafter: Controllable Multi-View Driving Scene Editing

Authors

Zehao Zhu
Yuliang Zou
Chiyu Max Jiang
Bo Sun
Vincent Casser
Xiukun Huang
Jiahao Wang
Johns Hopkins University
Zhenpei Yang
Ruiqi Gao
Google DeepMind
Leonidas Guibas
Google DeepMind
Mingxing Tan
Dragomir Anguelov

Abstract

Simulation is crucial for developing and evaluating
autonomous vehicle (AV) systems. Recent literature builds
on a new generation of generative models to synthesize
highly realistic images for full-stack simulation. However,
purely synthetically generated scenes are not grounded
in reality and have difficulty in inspiring confidence in
the relevance of its outcomes. Editing models, on the
other hand, leverage source scenes from real driving
logs, and enable the simulation of different traffic layouts,
behaviors, and operating conditions such as weather and
time of day. While image editing is an established topic
in computer vision, it presents fresh sets of challenges
in driving simulation: (1) the need for cross-camera
3D consistency, (2) learning “empty street” priors from
driving data with foreground occlusions, and (3) obtaining
paired image tuples of varied editing conditions while
preserving consistent layout and geometry. To address
these challenges, we propose SceneCrafter, a versatile
editor for realistic 3D-consistent manipulation of driving
scenes captured from multiple cameras. We build on
recent advancements in multi-view diffusion models, using
a fully controllable framework that scales seamlessly to
multi-modality conditions like weather, time of day, agent
boxes and high-definition maps. To generate paired data
for supervising the editing model, we propose a novel
framework on top of Prompt-to-Prompt to generate
geometrically consistent synthetic paired data with global
edits. We also introduce an alpha-blending framework to
synthesize data with local edits, leveraging a model trained
on empty street priors through novel masked training and
multi-view repaint paradigm. SceneCrafter demonstrates
powerful editing capabilities and achieves state-of-the-art
realism, controllability, 3D consistency, and scene editing
quality compared to existing baselines.