Project 5: Exploring Diffusion Models

Student Name: Kelvin Huang

Part A: The Power of Diffusion Models!

Part 0: Setup

We use the DeepFloyd IF diffusion model, a two-stage text-to-image model by Stability AI. The first stage generates small images, and the second refines them to higher resolutions.

My random seed is 180 throughout the project.

Inference Steps: 20

Size: 64x64

Man Hat 64x64
A man wearing a hat
Oil Painting 64x64
An oil painting of a snowy mountain village
Rocket Ship 64x64
A rocket ship

Size: 256x256

Man Hat 256x256
A man wearing a hat
Oil Painting 256x256
An oil painting of a snowy mountain village
Rocket Ship 256x256
A rocket ship

Inference Steps: 80

Size: 64x64

Man Hat 64x64 80 Steps
A man wearing a hat
Oil Painting 64x64 80 Steps
An oil painting of a snowy mountain village
Rocket Ship 64x64 80 Steps
A rocket ship

Size: 256x256

Man Hat 256x256 80 Steps
A man wearing a hat
Oil Painting 256x256 80 Steps
An oil painting of a snowy mountain village
Rocket Ship 256x256 80 Steps
A rocket ship

Reflection:

At a lower resolution (64x64 from Stage 1), the outputs capture the basic structure and color scheme corresponding to the prompts but lack fine details, resulting in somewhat abstract representations. In contrast, the higher resolution (256x256 from Stage 2) images refine these initial representations, adding significant detail and texture, resulting in visually coherent outputs. Varying num_inference_steps (e.g., from 20 to 80) reveals a trade-off: fewer steps produce faster but less refined results, while more steps improve the quality at the expense of computation time.

Part 1: Sampling Loops

Diffusion models generate images by reversing a noise-adding process. Starting with a clean image \( x_0 \), noise is iteratively added at each timestep \( t \), creating progressively noisier images \( x_t \) until reaching pure noise at \( t = T \). The goal of the diffusion model is to predict and remove this noise step-by-step, enabling the reconstruction of \( x_0 \) or partially denoised versions like \( x_{t-1} \).

The generation process begins with a pure Gaussian noise sample \( x_T \) at \( T = 1000 \) (for DeepFloyd). Using learned noise coefficients \( \bar{\alpha}_t \), the model estimates the noise in \( x_t \), which is then subtracted to obtain a cleaner image for the previous timestep. This iterative sampling continues until a clean image \( x_0 \) is reconstructed. The noise coefficients \( \bar{\alpha}_t \) and the denoising steps are pre-determined during training.

Part 1.1: Implementing the Forward Process

A key component of diffusion models is the forward process, which takes a clean image \( x_0 \) and progressively adds noise to it, resulting in noisy versions \( x_t \) at each timestep \( t \). This process is defined mathematically as:

\( q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) \mathbf{I}) \)

This is equivalent to computing:

\( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \),

where \( \epsilon \sim \mathcal{N}(0, 1) \). Here, \( x_t \) is sampled from a Gaussian distribution with mean \( \sqrt{\bar{\alpha}_t} x_0 \) and variance \( (1 - \bar{\alpha}_t) \). Note that the forward process involves both adding noise and scaling the original image \( x_0 \) by \( \sqrt{\bar{\alpha}_t} \).

Campanile Original Picture
Campanile original picture
Campanile at Noise = 250
Campanile at noise = 250
Campanile at Noise = 500
Campanile at noise = 500
Campanile at Noise = 750
Campanile at noise = 750

Part 1.2: Classical Denoising

Trying to use Gaussian blur filtering to try to remove the noise above.

Campanile at Noise = 250
Noisy Campanile at \( t = 250 \)
Campanile at Noise = 500
Noisy Campanile at \( t = 500 \)
Campanile at Noise = 750
Noisy Campanile at \( t = 750 \)
Gaussian Blur at Noise = 250
Gaussian Blur at \( t = 250 \)
Gaussian Blur at Noise = 500
Gaussian Blur at \( t = 500 \)
Gaussian Blur at Noise = 750
Gaussian Blur at \( t = 750 \)

In applying Gaussian blur, we used a kernel size of \( 7 \) and a standard deviation (\( \sigma \)) of \( 1.3 \). While this method reduced some noise effectively, it also blurred important details, showing the limitations of classical denoising techniques.

Part 1.3: One-Step Denoising

Used a pretrained diffusion model to denoise the images. The denoiser, available at stage_1.unet, is a U-Net architecture trained on a large dataset of \((x_0, x_t)\) image pairs. This model predicts and removes Gaussian noise from noisy images, effectively reconstructing (or approximating) the original clean image \(x_0\).This U-Net is conditioned on the timestep \(t\).

Noisy Campanile at t=250
Noisy Campanile at \( t = 250 \)
Noisy Campanile at t=500
Noisy Campanile at \( t = 500 \)
Noisy Campanile at t=750
Noisy Campanile at \( t = 750 \)
Denoised Campanile at t=250
Campanile at \( t = 250 \) with One-Step Denoiser
Denoised Campanile at t=500
Campanile at \( t = 500 \) with One-Step Denoiser
Denoised Campanile at t=750
Campanile at \( t = 750 \) with One-Step Denoiser

Part 1.4: Iterative Denoising

To efficiently denoise images iteratively, we can create a list of timesteps, called strided_timesteps, which skips steps in the denoising process. This list starts with the noisiest image (highest \( t \)) and ends with the clean image (lowest \( t \)), such that strided_timesteps[-1] corresponds to a clean image. A simple approach is to use a regular stride step (e.g., a stride of 30 works well).

On the \( i \)-th denoising step, we are at \( t = \text{strided_timesteps}[i] \), and want to get to \( t' = \text{strided_timesteps}[i+1] \) (a less noisy image). The denoising step is computed using the formula:

\( x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}\beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} x_t + \nu_{\sigma} \)

Where:

The random noise term \( \nu_{\sigma} \) is predicted by the model (e.g., DeepFloyd), and the exact process to compute it is abstracted using the add_variance function. This iterative approach progressively refines the image, transitioning from noise to a clean approximation.

Iteratively Denoised Campanile
Iteratively Denoised Campanile
Noisy Campanile at t=90
Noisy Campanile at \( t = 90 \)
Noisy Campanile at t=240
Noisy Campanile at \( t = 240 \)
Noisy Campanile at t=390
Noisy Campanile at \( t = 390 \)
Noisy Campanile at t=540
Noisy Campanile at \( t = 540 \)
Noisy Campanile at t=690
Noisy Campanile at \( t = 690 \)
Original Campanile
Original Campanile
One-Step Denoised Campanile
One-Step Denoised Campanile
Gaussian Blurred Campanile
Gaussian Blurred Campanile

Part 1.5: Diffusion Model Sampling

By setting i_start = 0 and passing in random noise, we use the diffusion model to generate images from scratch. The process iteratively refines the noisy input into coherent images, as demonstrated below.

Generated Image 1
Generated Image 1
Generated Image 2
Generated Image 2
Generated Image 3
Generated Image 3
Generated Image 4
Generated Image 4
Generated Image 5
Generated Image 5

Part 1.6: Classifier-Free Guidance (CFG)

We noticed that the generated images in the prior section are not very good, and some are completely nonsensical. To improve the quality of the generated images, we use a technique called Classifier-Free Guidance (CFG).

In CFG, we compute both a conditional and an unconditional noise estimate. We denote these as \( \epsilon_c \) and \( \epsilon_u \). Then, we let our new noise estimate be:

\( \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \)

Where \( \gamma \) controls the strength of CFG. Notice that for \( \gamma = 0 \), we get an unconditional noise estimate, and for \( \gamma = 1 \), we get the conditional noise estimate. The magic happens when \( \gamma > 1 \). In this case, we get much higher-quality images.

Some images at \( \gamma = 7\)

Generated Image 1 with CFG
Generated Image 1
Generated Image 2 with CFG
Generated Image 2
Generated Image 3 with CFG
Generated Image 3
Generated Image 4 with CFG
Generated Image 4
Generated Image 5 with CFG
Generated Image 5

Part 1.7: Image-to-Image Translation

In this task, we take the original test image, add a little noise, and force it back onto the image manifold without any conditioning. This process generates images that are similar to the test image but reflect slight variations based on the added noise.

Test Image: Campanile

Image with i_start=1
Image with \( i_{\text{start}} = 1 \)
Image with i_start=3
Image with \( i_{\text{start}} = 3 \)
Image with i_start=5
Image with \( i_{\text{start}} = 5 \)
Image with i_start=7
Image with \( i_{\text{start}} = 7 \)
Image with i_start=10
Image with \( i_{\text{start}} = 10 \)
Image with i_start=20
Image with \( i_{\text{start}} = 20 \)
Original Test Image
Original Test Image

My choice 1: Butterfly and Flower

Set 1 Image with i_start=1
Set 1: Image with \( i_{\text{start}} = 1 \)
Set 1 Image with i_start=3
Set 1: Image with \( i_{\text{start}} = 3 \)
Set 1 Image with i_start=5
Set 1: Image with \( i_{\text{start}} = 5 \)
Set 1 Image with i_start=7
Set 1: Image with \( i_{\text{start}} = 7 \)
Set 1 Image with i_start=10
Set 1: Image with \( i_{\text{start}} = 10 \)
Set 1 Image with i_start=20
Set 1: Image with \( i_{\text{start}} = 20 \)
Set 1 Original Image
Set 1: Original Test Image

My choice 2: Cat

Set 2 Image with i_start=1
Set 2: Image with \( i_{\text{start}} = 1 \)
Set 2 Image with i_start=3
Set 2: Image with \( i_{\text{start}} = 3 \)
Set 2 Image with i_start=5
Set 2: Image with \( i_{\text{start}} = 5 \)
Set 2 Image with i_start=7
Set 2: Image with \( i_{\text{start}} = 7 \)
Set 2 Image with i_start=10
Set 2: Image with \( i_{\text{start}} = 10 \)
Set 2 Image with i_start=20
Set 2: Image with \( i_{\text{start}} = 20 \)
Set 2 Original Image
Set 2: Original Test Image

Part 1.7.1: Editing Hand-Drawn and Web Images

In this task, we project nonrealistic images (e.g., paintings, sketches, or scribbles) onto the natural image manifold using the diffusion model. This demonstrates how the model transforms abstract or synthetic input into a more realistic representation.

Web Images

Web Image with i_start=1
Web Image with \( i_{\text{start}} = 1 \)
Web Image with i_start=3
Web Image with \( i_{\text{start}} = 3 \)
Web Image with i_start=5
Web Image with \( i_{\text{start}} = 5 \)
Web Image with i_start=7
Web Image with \( i_{\text{start}} = 7 \)
Web Image with i_start=10
Web Image with \( i_{\text{start}} = 10 \)
Web Image with i_start=20
Web Image with \( i_{\text{start}} = 20 \)
Original Web Image
Original Web Image

Paint 1: Flower

Paint 1 with i_start=1
Paint 1 with \( i_{\text{start}} = 1 \)
Paint 1 with i_start=3
Paint 1 with \( i_{\text{start}} = 3 \)
Paint 1 with i_start=5
Paint 1 with \( i_{\text{start}} = 5 \)
Paint 1 with i_start=7
Paint 1 with \( i_{\text{start}} = 7 \)
Paint 1 with i_start=10
Paint 1 with \( i_{\text{start}} = 10 \)
Paint 1 with i_start=20
Paint 1 with \( i_{\text{start}} = 20 \)
Original Paint 1
Original Paint 1

Paint 2: Twitter

Paint 2 with i_start=1
Paint 2 with \( i_{\text{start}} = 1 \)
Paint 2 with i_start=3
Paint 2 with \( i_{\text{start}} = 3 \)
Paint 2 with i_start=5
Paint 2 with \( i_{\text{start}} = 5 \)
Paint 2 with i_start=7
Paint 2 with \( i_{\text{start}} = 7 \)
Paint 2 with i_start=10
Paint 2 with \( i_{\text{start}} = 10 \)
Paint 2 with i_start=20
Paint 2 with \( i_{\text{start}} = 20 \)
Original Paint 2
Original Paint 2

Part 1.7.2: Inpainting

Given an original image \( x_{\text{orig}} \) and a binary mask \( \mathbf{m} \), the model creates a new image that retains the original content where \( \mathbf{m} = 0 \), but generates new content where \( \mathbf{m} = 1 \).

Run the diffusion denoising loop, but at each step, after obtaining \( x_t \), we "force" \( x_t \) to match the original image \( x_{\text{orig}} \) wherever \( \mathbf{m} = 0 \). Mathematically, this is represented as:

\( x_t \leftarrow \mathbf{m} x_t + (1 - \mathbf{m}) \text{forward}(x_{\text{orig}}, t) \)

Essentially, everything inside the edit mask \( \mathbf{m} \) is updated by the diffusion process, while everything outside the mask remains consistent with the original image, with the correct amount of noise added for the current timestep \( t \).

Test Image: Campanile

Original Image
Original Image
Edit Mask
Edit Mask
Noised Region to Replace
Noised Region to Replace
Final Inpainting Result
Final Inpainting Result

My Choise: Bird

Original Image
Original Image
Edit Mask
Edit Mask
Noised Region to Replace
Noised Region to Replace
Final Inpainting Result
Final Inpainting Result

My Choise: Beach

Original Image
Original Image
Edit Mask
Edit Mask
Noised Region to Replace
Noised Region to Replace
Final Inpainting Result
Final Inpainting Result

Part 1.7.3: Text-Conditional Image-to-Image Translation

In this section, we extend image-to-image translation by incorporating a text prompt to control the generated content. The text prompt guides the translation process, allowing for more specific and targeted modifications.

Test Image: Campanile
Text Prompt: "a rocket ship"

Test Image at t=1
Test Image at Noise Level 1
Test Image at t=3
Test Image at Noise Level 3
Test Image at t=5
Test Image at Noise Level 5
Test Image at t=7
Test Image at Noise Level 7
Test Image at t=10
Test Image at Noise Level 10
Test Image at t=20
Test Image at Noise Level 20
Original Test Image
Original Test Image
Edit Mask
Edit Mask
Replaced Content
Replaced Content

Test Image: Beach
Text Prompt: "a man wearing a hat"

Test Image 1 at t=1
Choice Image 1 at Noise Level 1
Test Image 1 at t=3
Choice Image 1 at Noise Level 3
Test Image 1 at t=5
Choice Image 1 at Noise Level 5
Test Image 1 at t=7
Choice Image 1 at Noise Level 7
Test Image 1 at t=10
Choice Image 1 at Noise Level 10
Test Image 1 at t=20
Choice Image 1 at Noise Level 20
Original Test Image 1
Original Choice Image 1
Edit Mask for Test Image 1
Edit Mask
Replaced Content for Test Image 1
Replaced Content

Test Image: Man
Text Prompt: "a rocket ship"

Test Image 2 at t=1
Choice Image 2 at Noise Level 1
Test Image 2 at t=3
Choice Image 2 at Noise Level 3
Test Image 2 at t=5
Choice Image 2 at Noise Level 5
Test Image 2 at t=7
Choice Image 2 at Noise Level 7
Test Image 2 at t=10
Choice Image 2 at Noise Level 10
Test Image 2 at t=20
Choice Image 2 at Noise Level 20
Original Test Image 2
Original Choice Image 2
Edit Mask for Test Image 2
Edit Mask
Replaced Content for Test Image 2
Replaced Content

1.8 Visual Anagrams

In this section, we create optical illusions with diffusion models by using a clever combination of transformations and denoising steps.

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \\ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \\ \epsilon = (\epsilon_1 + \epsilon_2) / 2 \]

First denoise an image \( x_t \) at step \( t \) normally with prompt 1 to obtain noise estimate \( \epsilon_1 \).
Then flip \( x_t \) upside down, and denoise with prompt 2 to get noise estimate \( \epsilon_2 \).
We can flip \( \epsilon_2 \) back, to make it right-side up, and average the two noise estimates.
We can then perform a reverse/denoising diffusion step with the averaged noise estimate.

Visual Anagram 1 - Image 1
"an oil painting of people around a campfire""
Visual Anagram 1 - Image 2
"an oil painting of an old man"
Visual Anagram 2 - Image 2
"a photo of the amalfi cost"
Visual Anagram 2 - Image 1
"a photo of a man"
Visual Anagram 3 - Image 2
"an oil painting of a snowy mountain village"
Visual Anagram 3 - Image 1
"a man wearing a hat"

1.9 Hybrid Images

In order to create hybrid images with a diffusion model, create a composite noise estimate \( \epsilon \), by estimating the noise with two different text prompts, and then combining low frequencies from one noise estimate with high frequencies of the other.

\( \epsilon_1 = \text{UNet}(x_t, t, p_1) \)
\( \epsilon_2 = \text{UNet}(x_t, t, p_2) \)
\( \epsilon = f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2) \)

UNet is the diffusion model UNet, \( f_{\text{lowpass}} \) is a low-pass function, \( f_{\text{highpass}} \) is a high-pass function, and \( p_1 \), \( p_2 \) are two different text prompt embeddings. Our final noise estimate is \( \epsilon \).

Hybrid Image 1
Hybrid Image 1
Hybrid Image 2
Hybrid Image 2
Hybrid Image 3
Hybrid Image 3

Gaussian blur of kernel size 33 and sigma 2.

Hybrid Image 1: like a skull from far away but a waterfall from close up.
Hybrid Image 2: like a old man from far away but people around a campfire from close up.
Hybrid Image 3: like a dog from far away but waterfall from close up.

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

UNet and Operations

Build a simple one-step denoiser.
Given a noisy image \( z \), we aim to train a denoiser \( D_\theta \) such that it maps \( z \) to a clean image \( x \). To do so, we can optimize over an \( L_2 \) loss:

\( L = \mathbb{E}_{z,x} \| D_\theta(z) - x \|^2 \)

Unconditional U-Net Structure
Unconditional U-Net Structure
Standard U-Net Operations
Standard U-Net Operations

Training Data Pairs

To train our denoiser, we need to generate training data pairs of \((z, x)\), where each \(x\) is a clean MNIST digit. For each training batch, we can generate \(z\) from \(x\) using the following noising process:

\( z = x + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \).

Here is what I used for the training data pairs:

MNIST Noise Levels
Varying levels of noise on MNIST digits

Training

Training Result

1 Epoch

Results on digits from the test set after 1 epoch of training

5 Epoch

Results on digits from the test set after 5 epoch of training

Out-of-Distribution Testing

Results on digits from the test set with varying noise levels.

Part 2: Training a Diffusion Model

The forward process for generating noisy images is defined as:

\( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \), where \( \epsilon \sim \mathcal{N}(0, 1) \).

The training objective for denoising is to minimize the L2 loss:

\( L = \mathbb{E}_{\epsilon, x_0, t} \| \epsilon_\theta(x_t, t) - \epsilon \|^2 \).

Time Conditioned UNet and FCBlock

Conditioned UNet
FCBlock for conditioning

Training UNet

Training time-conditioned UNet

Sampling from UNet

Sampling from time-conditioned UNet
Sampling results for the time-conditioned UNet for 5 Epochs
Sampling results for the time-conditioned UNet for 20 Epochs

Class-Conditioning UNet

Training class-conditioned UNet

Sampling from Class-Conditioned UNet

Sampling from class-conditioned UNet
Sampling results for the class-conditioned UNet for 5 Epochs
Sampling results for the class-conditioned UNet for 20 Epochs

What I Learned

I have used diffusion models many times but never had the chance to implement one from scratch. This was a very valuable experience for me. A big shout-out to all the staff who developed this project—it's an awesome project, and I enjoyed it so much!