CS 280A Project 5

Part 0: Setup

We start first with some setup where we are using the DeepFloyd IF diffusion model and we are showing the 3 text prompts and display the caption and the output of the model. We have that the first set of image has the number of inference steps are 20

First Example from model with the following prompt an oil painting of a snowy mountain village with number of inference steps are 20 and 64x64

Second Example from model with the following prompt a man wearing a hat with number of inference steps are 20 and 64x64

Third Example from model with the following prompt a rocket ship with number of inference steps are 20 and 64x64

First Example from model with the following prompt an oil painting of a snowy mountain village with number of inference steps are 20 and 256x256

Second Example from model with the following prompt a man wearing a hat with number of inference steps are 20 and 256x256

Third Example from model with the following prompt a rocket ship with number of inference steps are 20 and 256x256

First Example from model with the following prompt an oil painting of a snowy mountain village with number of inference steps are 40 and 64x64

Second Example from model with the following prompt a man wearing a hat with number of inference steps are 40 and 64x64

Third Example from model with the following prompt a rocket ship with number of inference steps are 40 and 64x64

First Example from model with the following prompt an oil painting of a snowy mountain village with number of inference steps are 40 and 256x256

Second Example from model with the following prompt a man wearing a hat with number of inference steps are 40 and 256x256

Third Example from model with the following prompt a rocket ship with number of inference steps are 40 and 256x256

First Example from model with the following prompt an oil painting of a snowy mountain village with number of inference steps are 10 and 64x64

Second Example from model with the following prompt a man wearing a hat with number of inference steps are 10 and 64x64

Third Example from model with the following prompt a rocket ship with number of inference steps are 10 and 64x64

First Example from model with the following prompt an oil painting of a snowy mountain village with number of inference steps are 10 and 256x256

Second Example from model with the following prompt a man wearing a hat with number of inference steps are 10 and 256x256

Third Example from model with the following prompt a rocket ship with number of inference steps are 10 and 256x256

From this we see that as the number of inference steps increase the image seems to be sharper. The seed I am setting is 180.

Part 1: Sampling Loops

1.1 Implementing the Forward Process

The first step I did is to write the forward process where we use the equation that was given to us sepcifically I use the following equation

\[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, 1) \]

which is essentially given a clean image \( x_0 \) we will get a noisy image \( x_t \) at time step \( t \) by sampling from a Gaussian noise with mean \( \sqrt{\bar{\alpha}_t} \) and variance \( \sqrt{1 - \bar{\alpha}_t} \). We implement the forward(im,t) function and the result below show the test image at noise level [0, 250, 500, 750]

Original Berkeley Campnile

Noisy Campanile image at t = 250

Noisy Campanile image at t = 500

Noisy Campanile image at t = 750

1.2 Classical Denoising

Given now that we have the noisy images we want to try to denoise these images the first thing we can try is to use the Gaussian blur filtering (we used the kernel size to be 13 with sigma to be 2) to try to remove the noise the result below shows the result of applying gaussian blur at each time step [250, 500, 750]

Noisy Campanile image at t = 250

Noisy Campanile image at t = 500

Noisy Campanile image at t = 750

Gaussian Blur Denoising at t = 250

Gaussian Blur Denoising at t = 500

Gaussian Blur Denoising at t = 750

As we can see from this that this is not looking good at all so we will try to use other method instead

1.3 One-Step Denoising

We will use the pretraiend diffusion model to denoise the image we can use this to recover the Gaussian noise from the image the image below show the result of the 3 noisy image at time step [250, 500, 750] and also the one-step denoised for each of the image

Noisy Campanile image at t = 250

Noisy Campanile image at t = 500

Noisy Campanile image at t = 750

One-Step Denoised Campanile at t = 250

One-Step Denoised Campanile at t = 500

One-Step Denoised Campanile at t = 750

Again this look not the best however, it is still better than the gaussian blur earlier that we did now the next section we will do the iterative denoising

1.4 Iterative Denoising

Instead of doing one step denoising we can do a much better job by denoising each step and get the clear image at the end. To do this we first need to create a list of timesteps that we will call it as stided_timesteps where the first item in the list correspond to the noisiest image and then the last timesteps is the one that is a clean image so we used the stided_timesteps from T = 990 to T = 0 in steps of 30. Once we do that then for each of the timestep we use the following formula and we implement the function iterative_denois(image, i_start)

\[ x_t = \frac{\sqrt{\bar{\alpha}_t \beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t (1 - \bar{\alpha}_t)}}{1 - \bar{\alpha}_t} x_t + v_\sigma \]

Where:

\( x_t \) is your image at timestep \( t \)
\( x_{t'} \) is your noisy image at timestep \( t' \) where \( t' < t \) (less noisy)
\( \bar{\alpha}_t \) is defined by alphas_cumprod, as explained above.
\( \alpha_t = \frac{\bar{\alpha}_t}{\bar{\alpha}_{t'}} \)
\( \beta_t = 1 - \alpha_t \)
\( x_0 \) is our current estimate of the clean image using equation 2 just like in section 1.3
\( v_\sigma \) is random noise, which in the case of DeepFloyd is also predicted.

We started first with i_start = 10 and the images below show the result of the noisy image of 5th loop of denoising and the final predicted clean image using iterative denoising, predicted clean image using single denoising step annd predicted clean image using gaussing bluring

Noisy Campanile image at t = 136

Noisy Campanile image at t = 238

Noisy Campanile image at t = 307

Noisy Campanile image at t = 477

Noisy Campanile image at t = 648

Original Image

Iterative Denoised Campanile

One Step Denoised Campanile

Gaussain Blurred Campanile

1.5 Diffusion Model Sampling

Once we get everything working now we can started to do some sampling image completely from the noise that is we set the i_start to be 0 the images below some sample 5 images

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

1.6 Classifier-Free Guidance

One thing we note here is that the image generated is not very good so we have that we can do better by applying the classifier free guidance that is we compute the noise estimage of both a conditional and unconditional and then we denote the final noise as where \( \epsilon_{c} \) denote the epsilon noised estimated conditional and \( \epsilon_{u} \) denote the epsilon noised estimated unconditional

\[ \epsilon = \epsilon_{u} + \gamma (\epsilon_{c} - \epsilon_{u}) \]

We implemented the iterative denoise cfg function and using the scale factor of 7 and show the images of 5 sample the images are below

Sample 1 with CFG

Sample 2 with CFG

Sample 3 with CFG

Sample 4 with CFG

Sample 5 with CFG

Sample 6 with CFG

1.7 Image-to-Image Translation

From the previous part we have that we take a real image and we add noise to it and then denoise but this part we will try to take the original image noise a little bit and force it back onto the image manifold without any conditioning. We use the iterative_denoise_cfg function using a starting index of [1, 3, 5,7,10,20] steps and show the results

First example of SDEdit with i_start = 1

First example of SDEdit with i_start = 3

First example of SDEdit with i_start = 5

First example of SDEdit with i_start = 7

First example of SDEdit with i_start = 10

First example of SDEdit with i_start = 20

Campanile

Second example of SDEdit with i_start = 1

Second example of SDEdit with i_start = 3

Second example of SDEdit with i_start = 5

Second example of SDEdit with i_start = 7

Second example of SDEdit with i_start = 10

Second example of SDEdit with i_start = 20

Capybara

Third example of SDEdit with i_start = 1

Third example of SDEdit with i_start = 3

Third example of SDEdit with i_start = 5

Third example of SDEdit with i_start = 7

Third example of SDEdit with i_start = 10

Third example of SDEdit with i_start = 20

Golden Gate

Editing Hand-Drawn and Web Image

Now we will try to use it to the web images and also the Hand-Drawn the images below show the result

Moodeng at i_start = 1

Moodeng at i_start = 3

Moodeng at i_start = 5

Moodeng at i_start = 7

Moodeng at i_start = 10

Moodeng at i_start = 20

Moodeng

House at i_start = 1

House at i_start = 3

House at i_start = 5

House at i_start = 7

House at i_start = 10

House at i_start = 20

House

Crab at i_start = 1

Crab at i_start = 3

Crab at i_start = 5

Crab at i_start = 7

Crab at i_start = 10

Crab at i_start = 20

Crab

1.7.2 Inpainting

We then apply the above proces such that we have an original image that is \( x_{orig} \) and a binary mask \( m \) then w can create a new image that has the same content whenever the mask is 0 and the new content whenever the mask is 1. This is done by using the normal diffusion step that we did however at every step we also have the following \[ x_t = mx_{t} + (1 - m)forward(x_{orgin},t) \] The images below show the result of the inpainting of the 3 images.

Campanile

First Mask

First Hole to Fill

Campanile Inpainted

Beach Image

Second Mask

Second Hole to Fill

Beach Inpainted

BigBen Image

Third Mask

Third Hole to Fill

Beach Inpainted

1.7.3 Text-Conditional Image-to-Image Translation

The next thing we can do is that instead of putting the text prompt of "a high quality photo" we can change our text prompt so that the image will generated from that text prompt and look similar to the original when we increase the noise level. The images below show the result where the first image we used the text prompt "a rocket ship", second image we used "robot head", and the third image we used "a strawberry".

Rocket Ship at noise level 1

Rocket Ship at noise level 3

Rocket Ship at noise level 5

Rocket Ship at noise level 7

Rocket Ship at noise level 10

Rocket Ship at noise level 20

Campanile

Robot Face at noise level 1

Robot Face at noise level 3

Robot Face at noise level 5

Robot Face at noise level 7

Robot Face at noise level 10

Robot Face at noise level 20

Human Head

Strawberry at noise level 1

Strawberry at noise level 3

Strawberry at noise level 5

Strawberry at noise level 7

Strawberry at noise level 10

Strawberry at noise level 20

Apple

1.8 Visual Anagrams

Now we can do something that is very cool that is we can use our diffusion model to make the visual anagrams that is when we look at the image we can see one thing and when we flipped the image then we will see another thing. This is done by the following method we will denoise image \( x_{t} \) at step \( t \) and then we use the first prompt to obtain the noise \( \epsilon_{1} \) and then we flip the \(x_{t} \) and use the second prompt to obtain the noise \( \epsilon_{2} \) and then we can flip \(\epsilon_{2}\) and then we can average the noise to get what we want. \[ \begin{align} \epsilon_{1} &= \text{UNet}(x_{t}, t, \text{prompt}_1) \\ \epsilon_{2} &= \text{flip}(\text{UNet}(\text{flip}(x_{t}), t, \text{prompt}_2)) \\ \epsilon &= \frac{\epsilon_{1} + \epsilon_{2}}{2} \end{align} \]

An Oil Painting of an Old Man

An Oil Painting of People around a Campfire

A sleeping cat

A mountain landscape

An oil painting of a snowy mountain village

An oil painting of penguin

1.9 Hybrid Images

The last thing we can do is that we can create the hybrid images just like in project 2. In order to this we can do something that is very similar to the previous part that is we are doing the following \[ \begin{align} \epsilon_{1} &= \text{UNet}(x_{t}, t, \text{prompt}_1) \\ \epsilon_{2} &= \text{UNet}(x_{t}, t, \text{prompt}_2) \\ \epsilon &= f_{\text{lowpass}}(\epsilon_{1}) + f_{\text{highpass}}(\epsilon_{2}) \end{align}] where we have that that the we use the gaussian filter to get the low pass filter and high pass filter with kernel size of 33 and sigma of 2. The images below show some of the result of the hybrid images. We get the inspiration of the caption from here For the first set of images we use the following prompts "a lithograph of a skull" and "a lithograph of waterfalls", for the second set of images we use the following prompts "a lithograph of a pig" and "a lithograph of waterfalls", for the third set of images we use the following prompts "a lithograph of a panda" and "a lithograph of flowers".

Hybrid Image of a skull and waterfall

Hybrid Image of a pig and waterfall

Hybrid Image of a panda and flower

Part B: Diffusion Models from Scratch!

1.2 Using the UNet to Train a Denoiser

Now that we already have all the operations needed for implement UNet. We can use that to train a denoiser \(D_{\theta} \) such that it denoise the noisy image \(z \) and we want to get the clean image (\(x \)). We can do this by using the L2 loss and we have that \[ z = x + \sigma \epsilon, where \quad \epsilon \sim N(0,1) \] The images below the diferent denoising processes from \( \sigma = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 \).

First example of image at sigma level is 0.0

First example of image at sigma level is 0.2

First example of image at sigma level is 0.4

First example of image at sigma level is 0.5

First example of image at sigma level is 0.6

First example of image at sigma level is 0.8

First example of image at sigma level is 1.0

Second example of image at sigma level is 0.0

Second example of image at sigma level is 0.2

Second example of image at sigma level is 0.4

Second example of image at sigma level is 0.5

Second example of image at sigma level is 0.6

Second example of image at sigma level is 0.8

Second example of image at sigma level is 1.0

Third example of image at sigma level is 0.0

Third example of image at sigma level is 0.2

Third example of image at sigma level is 0.4

Third example of image at sigma level is 0.5

Third example of image at sigma level is 0.6

Third example of image at sigma level is 0.8

Third example of image at sigma level is 1.0

Fourth example of image at sigma level is 0.0

Fourth example of image at sigma level is 0.2

Fourth example of image at sigma level is 0.4

Fourth example of image at sigma level is 0.5

Fourth example of image at sigma level is 0.6

Fourth example of image at sigma level is 0.8

Fourth example of image at sigma level is 1.0

Fifth example of image at sigma level is 0.0

Fifth example of image at sigma level is 0.2

Fifth example of image at sigma level is 0.4

Fifth example of image at sigma level is 0.5

Fifth example of image at sigma level is 0.6

Fifth example of image at sigma level is 0.8

Fifth example of image at sigma level is 1.0

1.2.1 Training

We then use the dataset and train our U-network for 5 epochs with the batch size of 256, hidden size of 128, and learning rate of 1e-4. We also use the sigma to be 0.5 and we shows the train losses for mini batch in every epoch so that is every batch size of 256. The images below show the training loss curve. We also show some sample of denosing at epoch 1 and epoch 5.

Training Loss Curve

Epoch 1 test set first example at sigma = 0

Epoch 1 test set first example at sigma = 0.5

Epoch 1 test set first example output

Epoch 1 test set second example at sigma = 0

Epoch 1 test set second example at sigma = 0.5

Epoch 1 test set second example output

Epoch 1 test set third example at sigma = 0

Epoch 1 test set third example at sigma = 0.5

Epoch 1 test set third example output

Epoch 5 test set first example at sigma = 0

Epoch 5 test set first example at sigma = 0.5

Epoch 5 test set first example output

Epoch 5 test set second example at sigma = 0

Epoch 5 test set second example at sigma = 0.5

Epoch 5 test set second example output

Epoch 5 test set third example at sigma = 0

Epoch 5 test set third example at sigma = 0.5

Epoch 5 test set third example output

1.2.2 Out-of-Distribution Testing

Once we are done with training we can use our model to do out-of-distribution testing. That is we used the trained model with noise 0.5 to denoise the images with noise level 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.

Noisy image with sigma = 0.0

Noisy image with sigma = 0.2

Noisy image with sigma = 0.4

Noisy image with sigma = 0.5

Noisy image with sigma = 0.6

Noisy image with sigma = 0.8

Noisy image with sigma = 1.0

Denoised output image with sigma = 0.0

Denoised output image with sigma = 0.2

Denoised output image with sigma = 0.4

Denoised output image with sigma = 0.5

Denoised output image with sigma = 0.6

Denoised output image with sigma = 0.8

Denoised output image with sigma = 1.0

Part 2

2.2 Training the UNet and 2.3 Sampling from the UNet

Once we defined the newly FCBlock and as we can see that the one step denoising might not be the best so w will try to implent to denoise the image iteratively the first thing we do is to add the time condition and use that to train our model. Since I finished this befroe the spec got change so my hyperparameter is the same as the previous part (that is batch size of 256, hidden size of 128, and learning rate of 1e-4) and also the loss present here is the train losses in every mini batch_size in every epoch that is every 256.