Novel View Synthesis with Diffusion Models

3D generation from a single image

We present 3DiM, a diffusion model for 3D novel view synthesis, which is able to translate a single input view into consistent and sharp completions across many views. The core component of 3DiM is a pose-conditional image-to-image diffusion model, which is trained to take a source view and its pose as inputs, and generates a novel view for a target pose as output. 3DiM can then generate multiple views that are approximately 3D consistent using a novel technique called stochastic conditioning. At inference time, the output views are generated autoregressively. When generating each novel view, one selects a random conditioning view from the set of previously generated views at each denoising step. We demonstrate that stochastic conditioning significantly improves 3D consistency compared to a naive sampler for an image-to-image diffusion model, which involves conditioning on a single fixed view. We compare 3DiM to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM's generated completions from a single view achieve much higher fidelity, while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to quantify the 3D consistency of a generated object by training a neural field on the model's output views. 3DiM is geometry free, does not rely on hyper-networks or test-time optimization for novel view synthesis, and allows a single model to easily scale to a large number of scenes.


Authored by Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi and Mohammad Norouzi from Google Research.


Examples: ShapeNet renders to 3D

We show samples from a single 3DiM trained on all of ShapeNet. The source views were taken from ShapeNet objects we did not include as part of the training data.


This 3DiM only takes the relative pose as input, rather than source and target absolute poses. This allows us to test the model on arbitrary images where no pose is available, as we only ask for relative changes in pose.


To increase robustness, the training view pairs are rendered with objects at a random orientation and varying scales. We rendered 128 pairs (256 views) for each training object using kubric. We also add random hue augmentation during training. The model was trained for 840K training steps at batch size 512, and has 471M parameters. We use 256 denoising steps at generation time with classifier-free guidance (with a guidance weight of 6).

Examples: in-the-wild images to 3D

These samples are from the same 3DiM as the first set of samples. The source images, however, we took the source images directly from the internet, just ensuring they correspond to some ShapeNet classes and that they have white backgrounds and little to no shadow.

Examples: Imagen to 3D

A mug with '3DiM' written on it, white background, no shadow.
A 20th century stove, white background, no shadow.
A bed made of stone, white background, no shadows, at an angle.
A rusty car at an angle, white background, no shadow.
An upright piano, at an angle, white background, no shadow.
An avocado in the shape of an armchair, white background, no shadow, at an angle, remove shadows.
An antique clock, white background, no shadow.
A modern printer, white background, no shadows.

These samples are from a single 3DiM trained on all of ShapeNet. The source images were produced with Imagen, an AI system that converts text into images.


To make Imagen generate objects on white backgrounds, we inpaint a 5px white border when generating a 64x64 image from text, and then upsample the images to 128x128 using a text-conditional super-resolution model.

Comparisons to prior work

Input View
SRN
PixelNeRF
VisionNeRF
3DiM (ours)
Ground Truth
Input View
SRN
PixelNeRF
VisionNeRF
3DiM (ours)
Real Data

We compare against prior state-of-the-art methods on novel view synthesis from few images on the SRN ShapeNet benchmark. The methods whose outputs we could acquire all guarantee 3D consistency, due to the use of volume rendering (unlike 3DiM). We render the same trajectories given the same conditioning image.


Prior methods directly regress outputs, often leading to severe bluriness. We show that 3DiM overcomes this problem: it is a generative model by design, and diffusion models have a natural inductive bias towards generating much sharper samples. Below we show more samples from the 3DiMs we trained for prior work comparisons; a 471M parameter 3DiM for cars, and a 1.3B parameter 3DiM for chairs.

State-of-the-art FID scores on SRN ShapeNet

3DiM Technical Details

Generation with 3DiM -- We propose stochastic conditioning, a new sampling strategy where we generate views autoregressively with an image-to-image diffusion model. At each denoising step, we condition on a random previous view, so the denoising process is guided to be 3D consistent to all previous frames with enough denoising steps.

3DiM research highlights

X-UNet -- Our proposed changes to the image-to-image UNet, which we show are critical to achieve high-quality results.