Skip to main content

SceneCrafter: Controllable Multi-View Driving Scene Editing

Authors

  • Zehao Zhu

  • Yuliang Zou

  • Chiyu Max Jiang

  • Bo Sun

  • Vincent Casser
  • Xiukun Huang

  • Jiahao Wang

    Johns Hopkins University

  • Zhenpei Yang

  • Ruiqi Gao

    Google DeepMind

  • Leonidas Guibas

    Google DeepMind

  • Mingxing Tan

  • Dragomir Anguelov

    Abstract

    Simulation is crucial for developing and evaluating
    autonomous vehicle (AV) systems. Recent literature builds
    on a new generation of generative models to synthesize
    highly realistic images for full-stack simulation. However,
    purely synthetically generated scenes are not grounded
    in reality and have difficulty in inspiring confidence in
    the relevance of its outcomes. Editing models, on the
    other hand, leverage source scenes from real driving
    logs, and enable the simulation of different traffic layouts,
    behaviors, and operating conditions such as weather and
    time of day. While image editing is an established topic
    in computer vision, it presents fresh sets of challenges
    in driving simulation: (1) the need for cross-camera
    3D consistency, (2) learning “empty street” priors from
    driving data with foreground occlusions, and (3) obtaining
    paired image tuples of varied editing conditions while
    preserving consistent layout and geometry. To address
    these challenges, we propose SceneCrafter, a versatile
    editor for realistic 3D-consistent manipulation of driving
    scenes captured from multiple cameras. We build on
    recent advancements in multi-view diffusion models, using
    a fully controllable framework that scales seamlessly to
    multi-modality conditions like weather, time of day, agent
    boxes and high-definition maps. To generate paired data
    for supervising the editing model, we propose a novel
    framework on top of Prompt-to-Prompt to generate
    geometrically consistent synthetic paired data with global
    edits. We also introduce an alpha-blending framework to
    synthesize data with local edits, leveraging a model trained
    on empty street priors through novel masked training and
    multi-view repaint paradigm. SceneCrafter demonstrates
    powerful editing capabilities and achieves state-of-the-art
    realism, controllability, 3D consistency, and scene editing
    quality compared to existing baselines.