Novel View Synthesis with Diffusion Models

3D generation from a single image

We present 3DiM (pronounced "three-dim"), a diffusion model for 3D novel view synthesis from as few as a single image. The core of 3DiM is an image-to-image diffusion model -- 3DiM takes a single reference view and a relative pose as input, and generates a novel view via diffusion. 3DiM can then generate a full 3D consistent scene following our novel stochastic conditioning sampler. The output frames of the scene are generated autoregressively. During the reverse diffusion process of each individual frame, we select a random conditioning frame from the set of previous frames at each denoising step. We demonstrate that stochastic conditioning yields much more 3D consistent results compared to the naïve sampling process which only conditions on a single previous frame. We compare 3DiMs to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM's generated videos from a single view achieve much higher fidelity while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to measure the 3D consistency of a generated object by training a neural field on the model's output views. 3DiMs are geometry free, do not rely on hyper-networks or test-time optimization for novel view synthesis, and allow a single model to easily scale to a large number of scenes.


Authored by Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi from Google Research.


3DiM is an AI system that creates 3D renderings from a single input image.

Generation with 3DiM -- We propose stochastic conditioning, a new sampling strategy where we generate views autoregressively with an image-to-image diffusion model. At each denoising step, we condition on a random previous view, so the denoising process is guided to be 3D consistent to all previous frames with enough denoising steps.

Results on diverse data


We show select samples from a single 3DiM trained on all of ShapeNet. We rendered 250 views for each asset with kubric, and trained a 471M parameter 3DiM. Videos are sampled from a single input image, with 256 denoising steps.

Pose Conditioning × Image-to-Image Diffusion


By allowing the core of 3DiM to remain an image-to-image model, we can bypass the difficulties of designing and training architectures that jointly model multiple frames. More importantly, we enable training with datasets that have as few as two views per scene.

3DiM research highlights

X-UNet -- Our proposed changes to the image-to-image UNet, which we show are critical to achieve high-quality results.

Comparisons to Prior Work


We compare against prior state-of-the-art methods on novel view synthesis from few images on the SRN ShapeNet benchmark. The methods whose outputs we could acquire all guarantee 3D consistency, due to the use of volume rendering (unlike 3DiM). We render the same trajectories given the same conditioning image.

Input View
SRN
PixelNeRF
VisionNeRF
3DiM (ours)
Ground Truth
Input View
SRN
PixelNeRF
VisionNeRF
3DiM (ours)
Real Data

State-of-the-art FID scores on SRN ShapeNet


Prior methods directly regress outputs, often leading to severe bluriness. We show that 3DiM overcomes this problem: it is a generative model by design, and diffusion models have a natural inductive bias towards generating much sharper samples. Below we show more samples from the 3DiMs we trained for prior work comparisons; a 471M parameter 3DiM for cars, and a 1.3B parameter 3DiM for chairs.

Special Thanks


We would like to thank Ben Poole for thoroughly reviewing this work, and providing useful feedback and ideas since the earliest stages of our research. We thank Tim Salimans for providing us with stable code to train diffusion models, which we used as the starting point for this paper, as well as code for neural network modules and diffusion sampling tricks used in their more recent "Video Diffusion Models" paper. We thank Erica Moreira for her critical support on juggling resource allocations for us to execute our work. We also thank David Fleet for his key support on securing the computational resources required for our work, as well as the many helpful research discussions throughout. We additionally would like to acknowledge and thank Kai-En Lin and Vincent Sitzmann for providing us with the outputs of their work on novel view synthesis and their helpful correspondence. We thank Mehdi Sajjadi and Etienne Pot for consistently lending us their expertise, especially on issues with datasets, cameras, rays, and all-things 3D. We thank Keunhong Park, who refactored a lot of the NeRF code we used, which made it easier to implement our proposed 3D consistency evaluation scheme. We thank Sarah Laszlo for helping us ensure our models and datasets meet responsible AI practices. Finally, we'd like to thank Geoffrey Hinton, Chitwan Saharia, and more widely the Google Brain Toronto team for their useful feedback, suggestions, and ideas throughout our research effort.