This part should belong to the following [Converge] A member of the series :
[Converge] Gradient Descent - Several solvers
[Converge] Weight Initialiser
[Converge] Backpropagation Algorithm 【BP Implementation details 】
[Converge] Feature Selection in training of Deep Learning 【 The effect of characteristic correlation 】
[Converge] Training Neural Networks 【cs231n-lec5&6, recommend 】
[Converge] Batch Normalisation
Need additional attention : The influence of weight initialization on gradient explosion and gradient disappearance

The paper is a little old , I feel there is no need to tangle at present :
This paper is mainly a reference paper : On optimization methods for deep learning, The main content of the article is notes
    • SGD( Stochastic gradient descent )
    • LBFGS( Limited BFGS)--> Broyden–Fletcher–Goldfarb–Shanno (BFGSalgorithm
    • CG( conjugate gradient method )
The three common optimization algorithms are deep learning Performance in the system .
Dropout yes hinton Proposed ? From his article Improving neural networks by preventing co-adaptation of feature detectors.
Nothing to say , If it is ensembling principle , It's better to use it than not .
I don't think it's practical , Newton's method is very good at present !
at present , Deep networks (Deep Nets) The mainstream method of weight training is gradient descent method ( combination BP Algorithm ), Before that, of course, we can use the unsupervised method ( for instance RBM,Autoencoder) To pre train the weights of the parameters ,
    • One disadvantage of gradient descent method in deep network is that the iterative change of weights is very small , It's easy to converge to a local optimum ;
    • Another disadvantage is that the gradient descent method can't deal with ill conditioned curvature ( such as Rosenbrock function ) The error function of .
And what we're talking about in this article is Hessian Free Method ( hereinafter referred to as HF) You don't need to pre train the weights of the network , The effect is not bad , And its application scope is wider ( It can be used for RNN And so on network study ), At the same time, it overcomes the problem of gradient descent method 2 Disadvantages .
HF The main idea is similar to Newton's iteration method , It just doesn't show how to calculate a certain point of error surface function Hessian matrix H, It's a technique to work out H And any vector v The product of the Hv( The matrix - The product form of vector needs to be used in the later optimization process ), So it's called ”Hessian Free”.

Convolution neural network structure change ——Maxout Networks,Network In Network,Global Average Pooling

By the way, understand the relevant concepts .

 Reference material
[] Maxout Networks,
[] Deep learning: Forty-five (maxout Simple understanding )
[] Paper notes 《Maxout Networks》 && 《Network In Network》
[] Fully convolutional networks for semantic segmentation,
[] Deep learning ( hexacosa- )Network In Network Learning notes
[] Network in Nerwork,
[] Improving neural networks by preventing co-adaptation of feature detectors


1、Maxout Network

Put forward a concept —— Linear variation +Max Operation can fit any convex function , Including activation functions (such as Relu).


If the activation function uses sigmoid Function words , In the process of forward propagation , The output expression of hidden layer node is :

W yes 2 dimension , This means that the second one is taken out i Column ( Corresponding to the first i Output nodes ), Subscript i The preceding ellipsis corresponds to... In all lines i In column .


If it is maxout Activation function , Then the output expression of the hidden layer node is :

W yes 3 dimension , Size is d*m*k,

  • d Represents the number of input layer nodes ,
  • m Represents the number of hidden layer nodes ,
  • k Represents the expansion of each hidden layer node k Intermediate nodes , this k All the intermediate nodes are linear outputs , and maxout Each node of the is to take this k The maximum output value of each intermediate node .

Refer to a Japanese maxout ppt One page in ppt as follows :

The consciousness of this picture is that , The hidden nodes in the purple circle expand into 5 A yellow node , take max.Maxout The ability of fitting is very strong , It can fit any convex function .

From left to right , In turn, they fit out ReLU,abs, Quadratic curve .

The author also proves this conclusion from a mathematical point of view , That is, just 2 individual maxout Nodes can fit any convex function ( Subtracting the ), The premise is that the number of intermediate nodes can be any number , As shown in the figure below , You can read the details paper[1].

maxout A strong assumption of is that the output is in a convex set in the input space …. Does this hypothesis hold ? although ReLU yes Maxout A special case of —— In fact, we can't get ReLU It's just the right situation , We're learning this nonlinear transformation , Using a combination of multiple linear transformations +Max operation .

Jeff: Whether it has a certain practical value ? It's still a good call ? It doesn't feel like the same improvement , Get to know a little bit about .

2、Network In Network

Some concepts of this paper , Include 1*1 Convolution ,global average pooling It has become the standard structure of network design , Have a unique view .

Look at the first NIN, Originally 11*11*3*96(11*11 Convolution of kernel, Output map 96 individual ) For one patch Output 96 A little bit , It's output feature map Of the same pixel 96 individual channel, But now there's an extra layer MLP, Put this 96 A point makes a full connection , Output again 96 A little bit

It's clever , This is new MLP Layer on layer Equivalent to One 1 * 1 The convolution of layer ,

In this way, it is very convenient to design the neural network structure , Just add one after the original convolution layer 1*1 The convolution of layer , Without changing the output size.

Be careful , Every convolution layer will follow ReLU. therefore , It's like the Internet is getting deeper , I understand that in fact, deepening is the main factor to improve the effect .

Significance lies in : It becomes a comprehensive effect of different feature extractors , save NN Space , But the effect is guaranteed , It makes sense to simplify the network .

【 Explain with examples , See original 】

Here's a concept built up , A fully connected network can be converted to 1*1 Convolution of , This idea It will be useful in many networks in the future , such as FCN[5].

3、Global Average Pooling

stay Googlenet In the network , Also used. Global Average Pooling, It's actually inspired by Network In Network.

Global Average Pooling It's usually used at the end of the network , Used to replace full connection FC layer , Why replace FC? Because in use , for example alexnet and vgg Networks are convoluting and softmax It's connected in series fc layer , I found some shortcomings :

(1) The number of parameters is very large , Sometimes a network exceeds 80~90% In the last few layers FC Layer ; 
(2) Easy to overfit , quite a lot CNN The over fitting of the network mainly comes from the final fc layer , Because there are too many parameters , But there's no right regularizer; Over fitting leads to the weakening of the generalization ability of the model ; 
(3) A very important point in practical application ,paper There is no mention of :FC The required input and output are fix Of , That is to say, the image must be of a given size , But in the actual , There are big and small images ,fc It's not convenient ;

The author puts forward Global Average Pooling, It's easy to do , It's for each individual feature map Take the whole picture average. Ask for output nodes And classification category The number is the same , So the back can be connected directly softmax 了 .

Author points out ,Global Average Pooling The benefits of a :

  • Because it's forcing the last feature map The quantity is equal to category Number , therefore feature map It will be interpreted as categories confidence maps.
  • No parameters , So it's not over fitting ;
  • The calculation of a plane , Making use of spatial information , It's more robust to images changing in space ;

for instance :

If , The last layer of data is 10 individual 6*6 Characteristic graph ,global average pooling It is to calculate the average value of all pixels in each feature map , Output a data value ,

such 10 A feature map will output 10 Data points , Put these data points together into a 1*10 In terms of vectors , It becomes an eigenvector , You can send it to softmax It is calculated in the classification of


In mid-2016, researchers at MIT demonstrated that CNNs with GAP layers (a.k.a. GAP-CNNs) that have been trained for a classification task can also be used for object localization.

That is, a GAP-CNN not only tells us what object is contained in the image - it also tells us where the object is in the image, and through no additional work on our part! The localization is expressed as a heat map (referred to as a class activation map), where the color-coding scheme identifies regions that are relatively important for the GAP-CNN to perform the object identification task.

and maxout(maxout Simple understanding ) equally ,DropConnect Also in ICML2013 Published on , It's also to improve Deep Network The generalization ability of , Both claim to be right Dropout Improvement .
    • And Dropout The difference is , It's not randomly clearing the output of hidden layer nodes 0,
    • The input weights of each node connected to it are expressed as 1-p The probability of clear 0.
According to the author ,Dropout and DropConnect It's similar to the model average ,Dropout yes 2^|m| The average of the two models , and DropConnect yes 2^|M| The average of the two models (m It's a vector ,M It's a matrix , Modulo represents the number of corresponding elements in a matrix or vector ),
At this point ,DropConnect The average power of the model is stronger ( because |M|>|m|)
Jeff:  Insignificant improvements , There is no absolute advantage .
stochastic pooling The method is very simple , Only need to feature map The elements in are randomly selected according to their probability values , That is to say, those with large element values are more likely to be selected . Not like max-pooling like that , Always take only the maximum element .
Jeff: How do you feel this paper It's all about irrigation .
Need additional attention : The influence of weight initialization on gradient explosion and gradient disappearance
I need to know something about .

[UFLDL] *Train and Optimize More articles about

  1. I am AI Knowledge system navigation - AI menu

    Relevant Readable Links Name Interesting topic Comment Edwin Chen Nonparametric Bayes   Boss Xu Yida Dirichlet Process Study ...

  2. [AI] Deep Mathematics - Bayes

    Mathematics is like the universe , Leeks only care about the practical part . scikit-learn (sklearn) Official document in Chinese scikit-learn Machine Learning in Python A novel onli ...

  3. Deep Learning 19_ Deep learning UFLDL course :Convolutional Neural Network_Exercise( Stanford deep learning course )

    Theoretical knowledge :Optimization: Stochastic Gradient Descent and Convolutional Neural Network CNN Convolution neural network derivation and implementation .Deep lear ...

  4. Deep Learning 1_ Deep learning UFLDL course :Sparse Autoencoder practice ( Stanford deep learning course )

    1 Preface The purpose of my blog is , Actually, I feel a lot of things , If you don't move for a long time, you will forget , In order to deepen the learning and memory, and facilitate later may forget to quickly recall what they have learned . First , I found some information on the Internet , I saw the introduction and said UFLDL Very not ...

  5. UFLDL Tutorial of ( One )sparseae_exercise

    below , take UFLDL In the tutorial sparseae_exercise The functions and notes in the exercise are listed below First , The calling relation of each function is given The main function :train.m (1) call sampleIMAGES Function to extract multiple... From a known image ...

  6. Deep learning Deep Learning UFLDL newest Tutorial Learning notes 1:Linear Regression

    1 Preface Andrew Ng Of UFLDL stay 2014 year 9 At the end of the month . For starting research Deep Learning It's really great news for children's shoes ! new Tutorial Compared to the old Tutorial Added Conv ...

  7. UFLDL Lesson notes and exercise answers 5 ( Self coding linear decoder and processing large image ** Convolution and pooling )

    Self active encoding linear decoder Self active encoder linear decoder is mainly considering the output assumption of the last layer of sparse self active encoder sigmoid function . Because sparse self active encoder learning is that the output is equal to the input .simoid The range of the function is [0,1] Between , this ...

  8. ( turn ) How to Train a GAN? Tips and tricks to make GANs work

    How to Train a GAN? Tips and tricks to make GANs work from : While r ...

  9. Deep Learning 13_ Deep learning UFLDL course :Independent Component Analysis_Exercise( Stanford deep learning course )

    Preface Theoretical knowledge :UFLDL course .Deep learning: Thirty-three (ICA Model ).Deep learning: Thirty-nine (ICA Model exercises ) Experimental environment :win7, matlab2015b,16G Memory ,2T machine ...

Random recommendation

  1. Maven Unit test report and test coverage

      Yes junit Unit test reports : Results like this ------------------------------------------------------- T E S T S ----------- ...

  2. Failed to create the part's controls [eclipse]

    View source code appear Failed to create the part's controls resolvent : eclipse.ini   Add : -startup plugins/org.eclipse.eq ...

  3. SocketTcpServer

    Customize SocketTcpServer, Although there are many such components now , But sometimes you need to integrate it into your framework or product , You don't need particularly powerful features , Customize according to the needs . The most basic problem is to determine the end of the packet , Didn't like super ...

  4. HTTP Each state returns a value

    Reprinted from : 502 Bad Gateway:tomcat Didn't start up  504 Gateway Time-out: nginx ...

  5. SQL Server 2008 Spatial data application series 8 : be based on Bing Maps(Silverlight) Spatial data storage of

    original text :SQL Server 2008 Spatial data application series 8 : be based on Bing Maps(Silverlight) Spatial data storage of Friendship tips , The prerequisites for you to read this blog are as follows : 1. This example is based on Microsoft S ...

  6. *42. Trapping Rain Water Rainwater collection

    1. Original title Given  n  Nonnegative integers indicate that each width is 1 Height map of columns of , Calculate columns arranged in this way , How much rain can be received after rain . Above is an array of [0,1,0,2,1,0,1,3,2,1,2,1] Height map of representation , Here ...

  7. [JDBC] You're really going to shut it down properly connection Do you ?

    Connection conn = null; PreparedStatement stmt = null; ResultSet rs = null; try { conn = DriverManag ...

  8. CSV Insert file into mysql The columns specified in the table

    Reference material : -CSV Insert file into mysql The columns specified in the table

  9. 【 Front end learning notes 05】JavaScript data storage Cookie Related method encapsulation

    //Cookie Set up // Set new cookie function setCookie(name,value,duration){ var date = new Date(); date.setTime( ...

  10. Record a glibc The resulting segment error and gdb transplant

    In the last article, I have a record of a paragraph error , Now record where the segment error was . // from GNU Download the currently in use glibc Source code and the latest glibc Source code // The address is as follows : http:// ...