Novel view synthesis result comparisons. Given a single image captured in an outdoor scene, our method synthesizes novel views with fewer visual artifacts, geometric deformities, and blurs. Notably, our method models favorable intricate details, such as tiny objects, symbols, and traffic signs, resulting in more photo-realistic views.
Recent novel view synthesis methods obtain promising results for relatively small scenes, e.g., indoor environments and scenes with a few objects, but tend to fail for unbounded outdoor scenes with a single image as input. In this paper, we introduce SAMPLING, a Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image based on improved multiplane images (MPI). Observing that depth distribution varies significantly for unbounded outdoor scenes, we employ an adaptive-bins strategy for MPI to arrange planes in accordance with each scene image. To represent intricate geometry and multi-scale details, we further introduce a hierarchical refinement branch, which results in high-quality synthesized novel views. Our method demonstrates considerable performance gains in synthesizing large-scale unbounded outdoor scenes using a single image on the KITTI dataset and generalizes well to the unseen Tanks and Temples dataset. The code and models will be made public.
Visualization results show our method generates better details compared to other single-view NVS methods.
(a) Disparity maps in previous work exhibit structural biases and missing objects, leading to unpleasant artifacts and distortions in the output. (b) The comparative disparity maps show that our method is capable of better recovering the spatial structure of complex scenes and intricate object boundaries. (c) Our method consistently delivers higher-quality and flawlessly disparity maps and outputs, even in challenging regions.
The symbol * denotes the model is trained on KITTI and evaluated on T&T.
Each compared group C consists of two synthesized views of outdoor scenes in the KITTI dataset, with the novel views synthesized by MINE (Top row) and the images generated by our method (Bottom row) at the same viewpoint. We highlight the challenging areas and hard cases in these outdoor scenes.
Each compared group C consists of two synthesized views of indoor scenes in the T&T dataset, with the novel views synthesized by MINE (Top row) and the images generated by our method (Bottom row) at the same viewpoint. Notably, both methods are trained on the KITTI dataset and are not fine-tuned on the indoor dataset.