SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image

ICCV 2023


Xiaoyu Zhou1, Zhiwei Lin1, Xiaojun Shan1, Yongtao Wang1, Deqing Sun2, Ming-Hsuan Yang3

1Wangxuan Institute of Computer Technology, Peking Univerisity, 2 Google Research, 3University of California, Merced

Novel view synthesis result comparisons. Given a single image captured in an outdoor scene, our method synthesizes novel views with fewer visual artifacts, geometric deformities, and blurs. Notably, our method models favorable intricate details, such as tiny objects, symbols, and traffic signs, resulting in more photo-realistic views.

Abstract

Recent novel view synthesis methods obtain promising results for relatively small scenes, e.g., indoor environments and scenes with a few objects, but tend to fail for unbounded outdoor scenes with a single image as input. In this paper, we introduce SAMPLING, a Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image based on improved multiplane images (MPI). Observing that depth distribution varies significantly for unbounded outdoor scenes, we employ an adaptive-bins strategy for MPI to arrange planes in accordance with each scene image. To represent intricate geometry and multi-scale details, we further introduce a hierarchical refinement branch, which results in high-quality synthesized novel views. Our method demonstrates considerable performance gains in synthesizing large-scale unbounded outdoor scenes using a single image on the KITTI dataset and generalizes well to the unseen Tanks and Temples dataset. The code and models will be made public.


Qualitative comparison of novel view synthesis on the KITTI dataset.

Visualization results show our method generates better details compared to other single-view NVS methods.


Qualitative comparison of disparity map and novel view synthesis on the KITTI dataset.

(a) Disparity maps in previous work exhibit structural biases and missing objects, leading to unpleasant artifacts and distortions in the output. (b) The comparative disparity maps show that our method is capable of better recovering the spatial structure of complex scenes and intricate object boundaries. (c) Our method consistently delivers higher-quality and flawlessly disparity maps and outputs, even in challenging regions.


The qualitative results of our method generalize to unseen dataset (T&T).

The symbol * denotes the model is trained on KITTI and evaluated on T&T.

Qualitative results on KITTI of outdoor scenes.

Each compared group C consists of two synthesized views of outdoor scenes in the KITTI dataset, with the novel views synthesized by MINE (Top row) and the images generated by our method (Bottom row) at the same viewpoint. We highlight the challenging areas and hard cases in these outdoor scenes.


Qualitative results on T&T of indoor scenes.

Each compared group C consists of two synthesized views of indoor scenes in the T&T dataset, with the novel views synthesized by MINE (Top row) and the images generated by our method (Bottom row) at the same viewpoint. Notably, both methods are trained on the KITTI dataset and are not fine-tuned on the indoor dataset.