In recent years, the most popular thing in the field of deep learning is to generate confrontation networks , namely Generative Adversarial Networks(GANs) 了 . It is Ian Goodfellow stay 2014 Published in , It is also a variety of... In the past four years GAN The ancestor of the mutant of , The following figure shows the four years of GAN Number of papers published per month , As can be seen in the 2014 Proposed in 2016 There are few related papers in , But from 2016 year , Or is it 2017 The two years from 2010 to this year , Related papers are really showing blowout growth .
that ,GAN What is it , Why has it become such a hot research field in recent years ？
GAN, That is to say, to generate an adversary network , yes Generate a model , It is also a semi supervised and unsupervised learning model , It can learn depth representation without a large amount of annotation data . The biggest feature is to propose a method of training two deep network confrontation .
At present, machine learning can be divided into three types according to whether the data set has labels , Supervised learning 、 Semi supervised learning and unsupervised learning , The most mature development , At present, the best method is supervised learning , But when the number of data sets requires more and larger , The cost of obtaining tags is also more expensive , Therefore, more and more researchers hope to have a better development in unsupervised learning , and GAN Appearance , For one thing, it doesn't need a lot of annotation data , You don't even need a label , Second, it can do many things , At present, its applications include image synthesis 、 image editing 、 Style transfer 、 Image super-resolution and image conversion .
For example, font conversion , stay zi2zi In this project , The transformation of Chinese font is given , The effect is shown below ,GAN You can learn different Fonts , Then transform it .
In addition to font learning , And the conversion of pictures , pix2pix You can do that , The results are shown in the figure below , The split image becomes a real photo , From black and white to color , From line painting to rich texture 、 Shadows and glossy pictures, etc , These are all this pix2pixGAN The result of implementation .
CycleGAN You can achieve style transfer , The implementation results are shown in the figure below , Real photos become impressions , The exchange of ordinary horses and zebras , The change of seasons, etc .
The above is GAN Some application examples of , Next, I will briefly introduce GAN The principle and its advantages and disadvantages , Of course, there is also the reason why more and more..., after it was put forward two years ago GAN Relevant papers are published .
1. The basic principle
GAN It's very simple , Namely The game between generator network and discriminator network .
GAN It is mainly composed of two networks , Generator network (Generator) And discriminator networks (Discriminator), Through the mutual game between the two networks , So that the generator network can finally learn the distribution of input data , This is the same. GAN What you want to achieve – Learn the distribution of input data . Its basic structure is shown in the figure below , It can be better understood from the figure below G and D The function of , Respectively ：
- D It's a discriminator , Responsible for the input of real data and by G Judge the generated false data , Its output is 0 and 1, That is, it is essentially a binary classifier , The goal is that the input is real data, and the output is 1, Input of false data , The output is 0;
- G It's a generator. , It receives a random noise , And generate images .
During the training ,G The goal is to generate as real data as possible to confuse D, and D It's about putting G The generated pictures can be identified , In this way, the two are playing games with each other , The ultimate goal is to achieve a balance , That's Nash equilibrium .
( The following advantages and disadvantages mainly come from Ian Goodfellow stay Quora The answer on , And the answer of Zhihu )
- GAN The model only uses back propagation , Without Markov chain
- There is no need to infer hidden variables during training
- Theoretically , Any differentiable function can be used to construct D and G , Because it can be combined with depth neural network to make depth generative model
- G The parameter update of is not directly from the data sample , Instead, it uses information from D Back propagation of
- Compared with other generation models （VAE、 Boltzmann machine ）, Better samples can be generated
- GAN It is a semi supervised learning model , You don't need too much labeled data for the training set ;
- There is no need to follow any kind of factorization to design the model , All generators and discriminators work properly
- Poor interpretability , Generate the distribution of the model
Pg(G)There is no explicit expression
- It's hard to train , D And G Need good synchronization between , for example D to update k And then G Updated once
- Training GAN Need to reach Nash equilibrium , Sometimes it can be done by gradient descent , Sometimes I can't . We haven't found a good way to reach Nash equilibrium , So training GAN comparison VAE perhaps PixelRNN It's not stable , But I think in practice it is more stable than training the Boltzmann machine .
- It's hard to learn to generate discrete data , Like text
- Compared with Boltzmann machine ,GANs It's hard to guess another pixel value from one pixel value ,GANs Born to do one thing , That's all pixels at once , You can use it. BiGAN To fix this feature , It allows you to use... Like a Boltzmann machine Gibbs Sampling to guess missing values
- Unstable training ,G and D It's hard to converge ;
- Training will also encounter gradient disappearance 、 The problem of pattern collapse
- There is a lack of effective direct and observable methods to evaluate the effect of model generation
3.1 Why do gradients disappear and patterns collapse in training
GAN The essence of G and D Play games with each other and finally reach a Nash equilibrium , But this is only an ideal situation , The normal situation is that one side is strong and the other is weak , And once this relationship is formed , Without finding a way to balance in time , Then there will be a problem . The disappearance of gradient and mode collapse are actually the two results in this case , They correspond to each other D and G It's the result of the strong side .
First, for the case where the gradient disappears D The better ,G The more the gradient disappears , because G The gradient update is from D, And at the beginning of training ,G The input is randomly generated noise , It certainly won't produce a good picture ,D It will be easy to judge the true and false samples , That is to say D There is little loss in training , There is no effective gradient information back to G Give Way G To optimize yourself . This phenomenon is called gradient vanishing, The gradient vanishing problem .
secondly , For mode crash （mode collapse） problem , Mainly G Relatively strong , Lead to D Can't distinguish real pictures from G Generated fake pictures , And if at this point G In fact, when you can't completely generate enough real pictures , but D But I can't tell , And give a correct evaluation , that G Will think this picture is correct , Then continue to output this picture or these pictures , then D Or give the correct evaluation , So the two deceive each other , such G In fact, it will only output some fixed pictures , The result is that the generated image is not realistic , There is also the problem of insufficient diversity .
A more detailed explanation can be found in Amazing Wasserstein GAN, This article explains in more detail the original GAN The problem of , It mainly appears in loss On the function .
3.2 Why? GAN Not suitable for processing text data
- Text data is discrete compared with picture data , Because for text , It is usually necessary to map a word into a high-dimensional vector , The output of the final prediction is a one-hot vector , hypothesis softmax The output of is
（0.2, 0.3, 0.1,0.2,0.15,0.05）, Then it becomes onehot yes （0,1,0,0,0,0）, If softmax The output is （0.2, 0.25, 0.2, 0.1,0.15,0.1 ）,one-hot Still
（0, 1, 0, 0, 0, 0）, So for the generator ,G Output different results , however D The same discrimination result is given , The gradient update information can not be well transmitted to G In the middle , therefore D The discrimination of the final output is meaningless .
- GAN The loss function of is JS The divergence ,JS Divergence is not suitable for measuring the distance between distributions that do not want to intersect .（WGAN Although the use of wassertein Distance replaces JS The divergence , But the ability to generate text is still limited ,GAN Applications in generating text are seq-GAN, And reinforcement learning ）
3.3 Why? GAN Optimizers in are not commonly used SGD
- SGD Easy to shake , Easy to make GAN Your training is more unstable ,
- GAN The purpose of is to find Nash equilibrium ,GAN The Nash equilibrium is a saddle point , however SGD You'll only find Local minima , because SGD It solves the problem of finding the minimum value , but GAN It's a game problem .
For saddle point , The explanation from Baidu Encyclopedia is ：
saddle point （Saddle point） In differential equations , It's stable in one direction , The other direction is the singularity of instability , It's called saddle point . In a functional , It is neither a maximum point nor a critical point of a minimum point , It's called saddle point . In the matrix , A number is the maximum in the row , Is the minimum value in the column , It is called saddle point . It's more extensive in Physics , Refers to a maximum in one direction , The other direction is the point of the minimum .
Saddle point and local minimum point 、 The difference between local maximum points is shown in the figure below ：
4. Training skills
The training skills mainly come from Tips and tricks to make GANs work.
1. Normalize the input
- Normalize input to -1 and 1 Between
- G The output layer of is
2. Using a modified loss function
In primitive GAN In the paper , Loss function G Yes. m i n ( l o g ( 1 − D ) ) min (log(1-D)) min(log(1−D)), But the actual use is to use m a x ( l o g D ) max(logD) max(logD), The reason given by the author is that the former will lead to the disappearance of gradient .
But actually , Even the loss function proposed by the author is still problematic , That is, the problem of pattern collapse , In the following GAN In related papers , There are many papers to improve this problem , Such as WGAN The model proposes a new loss function .
3. Sampling noise from a sphere
- Do not use uniform distribution for sampling
- Random noise is sampled from Gaussian distribution
- When interpolating , Do this from the big circle , Not directly from the point A To spot B Linear operation , As shown in the figure below
- More details can be referred to Tom White’s The paper of Sampling Generative Networks And the code https://github.com/dribnet/plat
- use mini-batch BatchNorm, Make sure that each mini-batch Are the same real pictures or generated pictures
- Do not use BatchNorm When , May adopt instance normalization（ Normalized operation for each sample ）
- have access to Virtual batch normalization (virtural batch normalization): Predefine a before you start training batch R, For every new batch X, All use R+X To calculate the normalization parameters
5. Avoid sparse gradients ：Relus、MaxPool
- Sparse gradients affect GAN The stability of
- stay G and D Used in LeakyReLU Instead of Relu Activation function
- For downsampling , Average pooling can be used (Average Pooling) and Conv2d+stride alternatives
- For upsampling , have access to PixelShuffle(https://arxiv.org/abs/1609.05158), ConvTranspose2d + stride
6. Use of labels
- Label smoothing . That is, if there are two target tags , Suppose the real picture label is 1, Generating picture labels is 0, So for each input example , If it's a real picture , use 0.7 To 1.2 Between a random number as a label , instead of 1; Single side label smoothing is generally used
- In the training D When , Flip labels occasionally
- If you have tag data, try to use tags
7. Use Adam Optimizer
8. Track the cause of failure as early as possible
- D Of loss become 0, So this is the training failure
- Check the gradient of the specification ： If exceeded 100, There's something wrong with that
- If the training is normal , that D loss There is low variance and decreases over time
- If g loss Steady decline , So it's cheating with bad generated samples D
9. Don't use statistics to balance loss
10. Add noise to the input
- to D Add artificial noise to the input of
- to G Gaussian noise is added to each layer of the
11. about Conditional GANs Discrete variables
- Use one Embedding layer
- Add an additional channel to the input picture
- keep embedding The channel size of the image is matched by up sampling operation
12 stay G The training and testing phases of Dropouts
- With dropout Provide noise in the form of (50% Probability )
- Training and testing phase , stay G Use several layers of
Reference article ：
- Goodfellow et al., “Generative Adversarial Networks”. ICLR 2014.
- GAN Series learning (1)—— The past and the present
- dried food | Explain profound theories in simple language GAN· Principle text version （ complete ）
- Amazing Wasserstein GAN
- Generative antagonistic network (GAN) What are the advantages over traditional training methods ?
- Tips and tricks to make GANs work
notes ： The pictures come from the Internet and reference articles
The above is the main content and summary of this paper , You can leave a message to give your suggestions and opinions on this article .
Also, welcome to my WeChat official account – Machine learning and computer vision or scanning the QR code below , Share your suggestions and opinions with me , Correct possible mistakes in the article , Let's talk , Learning and progress ！