This section introduces the DeepFloyd IF diffusion models used for generating images. The models include a stage 1 U-Net for 64x64 images and a stage 2 U-Net for upscaling to 256x256. Below are results of generated images using different iterations.
In this section, I progressively added noise to a test image at different timesteps. I did this by implementing a forward pass with the formula x_t = sqrt(a_bar_t) * 𝑥0 + sqrt(1−a_bar_t) * e, where e ~ N(0,1) The results below illustrate the effect of increasing noise levels.
In this section, I attempt to denoise the images using Gaussian blurring on various noise levels. The results below show the limitations of this approach to denoising.
In this section, a pre-trained U-Net was used for one-step denoising. I estimate the noise in the noisy image by passing it through stage_1.unet and then remove the noise from the noisy image to obtain an estimate of the original image. Here are the results for different levels of noise.
In this section, I implemented iterative denoising in which I use stridded timesteps to iterative denoise the image gradually. This has improved results over one-step denoising by gradually reducing noise over time.
In this section, images were generated from pure random noise using iterative denoising by starting with a completely noisy image and then gradually removing the noise to generate a sampled image. Here are the sampled results.
In this section, I implement Classifier Free Guidance, where images are generated using a combination of conditional and unconditional sampling. To get an unconditional noise estimate, I pass an empty prompt embedding to the diffusion model and pass in 'a high quality photo' for the prompt embedding. By setting the guidance weight (gamma) to 7, I can generate higher quality images.
In this section, I implemented Image-to-image translation. To do so I take an original image, add noise to it, and then force it back onto the image manifold without any conditioning. This generates an image that is similar to the test image.
This section explores editing hand-drawn and web-sourced images. The process involves feeding these nonrealistic images into the model to attempt to bring them into the natural image manifold.
In this section, I implement Inpainting in which a mask a part of an image and then fill in the area within the mask using diffusion models. Below are results of inpainting applied to iconic landmarks like the Campanile, Eiffel Tower, and Leaning Tower of Pisa.
In this section, I implemented Text-conditioned image-to-image translation which combines the original image with a text prompt to guide the diffusion process in order to incorporate control through language. The results below demonstrate transformations applied to landmarks based on specific text descriptions.
In this section, I created Visual anagrams through the use of diffusion models to mix and overlay features from two distinct prompts. To achieve this I denoise an image at a given timestep normally with a prompt to obtain a noise estimate. Then at the same time, I flip the image upside down and denoise with a different prompt to get a second noise estimate. I flip this noise estimate back and average the two noise estimates before performing a reverse diffusion step to create the anagram.
I created hybrid images by estimating the noise with two different text prompts, and then combining low frequencies from one noise estimate with high frequencies of the other noise estimate. This generates a hybird image that looks different based on the distance it is viewed from.
In this section, I implemented and trained a diffusion model using the MNIST dataset. This involved building and training a UNet for iterative denoising, exploring time-conditioning, and class-conditioning.
In this part, I train a U-Net to denoise noisy MNIST images. To accomplish this, I add noise to the digits. Then, I train the U-Net on noisy and original image pairs in order for the net to learn how to denoise the iamges. For training, I set the noisy level to sigma=0.5. After training, I tested the performance on out of distribution images, meaning images with noise level other than sigma=0.5.
In this part, I make the U-Net predict the noise of an image instead of performing the denoising itself. This is done since iterative denoising works better than one-step denoising. To do this I add conditioning on the timestep of the diffusion process to the U-Net. Now the training workflow involves noising each image with a random timestep value from 0 to 299, passing it through the U-Net to get the predicted noise, and then calculating the loss between the predicted noise and actual noise added. To sample from the model, I utilize the iterative denoising algorithm where I start from an image of pure noise, and then iteratively move towards the clean image. The pseudocode for the training and sampling algorithms along with training results is shown below.
In this section, I add class-conditioning to the U-Net in order to improve the quality of the generated images. I now utilize the MNIST training data labels as a one-hot encoded vector and feed it as input into the U-Net as well to achieve class-conditioning. To sample, I pass in the one-hot encoded vector corresponding to the digit I want to generate. Psuedocode for the training and sampling algorithms for the class conditioned model is shown people along with training results.