1.introduction
The clip is robust to changes in image distribution and can be zero-shot. The diffusion model can satisfy sample diversity and has good fidelity.dalle2 combines the good features of both models.
2.method
The picture above is very good. Based on this picture, first of all, there is a clip above the dotted line. This clip is trained in advance and will not be used again during the training of dalle2.To train clip, it is a weight lock. In the training of dalle2, the input is also a pair of data, a text pair and its corresponding image, first enter a text, and go through the text encoding module of clip (bert, clip uses vit for images)., use bert to encode text, clip is a basic contrastive learning, the encoding of two modalities is very important, after modal encoding, the cosine is directly calculated for similarity).Image vector, this image vector is actually gt.The generated text code is input into the first prior model, which is a diffusion model, and an autoregressive transformer can also be used. This diffusion model outputs a set of image vectors, which are supervised by the image vectors generated by clip.It is actually a supervised model, followed by a decoder module. In the previous dalle, the encoder and the decoder were trained together in dvae, but the deocder here is a single training and a diffusion model. In fact, under the dotted lineThe generative model is to turn a complete generation step into a two-stage explicit image generation. The author experimented with this explicit generation.This article calls itself unclip, clip is to convert input text and images into features, and dalle2 is the process of converting text features into image features and then into images. In fact, image features to images are achieved through a diffusion model.In the deocder, both the classifier-free guidence and the clip's guidence are used. This guidence refers to the process of the decoder, the input is a noisy image at time t, and the final output is an image, this noisy image.A feature map obtained by unet each time can be judged by an image classifier. Here, the cross-entropy function is generally used for a two-classification, but the gradient of image classification can be obtained, and this gradient can be used to guide the diffusion to betterdecoder.
thank