neural network , Or parameter initialization of deep learning algorithm is a very important aspect , The traditional initialization method initializes parameters randomly from Gaussian distribution . It's even directly initialized to 1 perhaps 0. This way, violence is direct , But often the effect is mediocre . The narration of this article comes from a discussion post abroad , Let's elaborate on our own understanding .

First of all, let's think about , Why in neural network algorithms （ To simplify the problem , We use the most basic DNN Think about ） in , The choice of parameters is very important ？ With sigmoid function （logistic neurons） For example , When x When the absolute value of becomes larger , Function values are getting smoother , Tend to saturate , At this point, the reciprocal of the function tends to 0, for example , stay x=2 when , The derivative of the function is about 1/10, And in the x=10 when , The derivative of the function has become about 1/22000, in other words , The input to the activation function is 10 Than when 2 The learning speed of neural network is slower when the network is in the process of learning 2200 times ！ In order to make the neural network learn faster , We want to activate the function sigmoid The derivative of is larger . Numerically , About sigmoid The input is in [-4,4] Between , See above . Of course , It doesn't have to be that precise . We know , A neuron j The input of is the weighted sum of the outputs of the previous layer of neurons ,xj=∑iai⋅wi+bj. therefore , We can control the range of the initial value of the weight parameter , So that the input of neurons falls within the range we need .

## One is relatively simple 、 The effective way is ： The weight parameter initialization is uniformly random from the interval .

(−1d√,1d√), among d It's the number of inputs to a neuron .

In order to show the rationality of this value , Let's briefly review some basic knowledge ：

1. Consistent with uniform distribution U（a,b） The mathematical expectation and variance of random variables are respectively —— Mathematical expectation ：E(X)=(a+b)/2, variance ：D(X)=(b-a)²/12

2. If the random variable X,Y It's independent of each other , that Var(X+Y) = Var(X)+Var(Y), If X,Y Is independent of each other and the mean value is 0, that Var(X*Y) = Var(X)*Var(Y)

therefore , If we limit the input of neurons (xi) It's the mean =0, Standard deviation =1 Of , that

Var(wi)=(2d√)2/12=13d
Var(∑i=1dwixi)=d∗Var(wi)=13

in other words , Random d A weighted sum of input signals , The weight comes from (−1d√,1d√) Uniform distribution , Obey the mean =0, variance =1/3 Is a normal distribution , And with the d irrelevant . So the input to the neuron falls in the interval [-4,4] The probability outside is very small .

A more general form can be written as ：

∑i=0d<wixi>=∑i=0d<wi><xi>=0
*(∑i=0dwixi)(∑i=0dwixi)*=∑i=0d<w2i><x2i>=σ2d

## Another relatively new initial value method

according to Glorot & Bengio (2010) , initialize the weights uniformly within the interval [−b,b], where

b=6Hk+Hk+1−−−−−−−−−√,

Hk and Hk+1 are the sizes of the layers before and after the weight matrix, for sigmoid units. Or hyperbolic tangent units: sample a Uniform [−b,b] with

b=46Hk+Hk+1−−−−−−−−−√,

## Initial value methods for other scenarios 

• in the case of RBMs, a zero-mean Gaussian with a small standard deviation around 0.1 or 0.01 works well (Hinton, 2010) to initialize the weights.

• Orthogonal random matrix initialization, i.e. W = np.random.randn(ndim, ndim); u, s, v = np.linalg.svd(W) then use u as your initialization matrix.

## Reference material

 Bengio, Yoshua. “Practical recommendations for gradient-based training of deep architectures.” Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 437-478.

 LeCun, Y., Bottou, L., Orr, G. B., and Muller, K. (1998a). Efficient backprop. In Neural Networks, Tricks of the Trade.

 Glorot, Xavier, and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks.” International conference on artificial intelligence and statistics. 2010.

Welcome to reprint , Reprint please indicate : This article from the Bin The column blog.csdn.net/xbinworld. Technical communication QQ Group :433250724, Welcome to the algorithm . Technology interested students join .

Welcome to reprint , Reprint please indicate : This article from the Bin The column blog.csdn.net/xbinworld. Technical communication QQ Group :433250724, Welcome to the algorithm . Students who are interested in machine learning technology join in .

from:http://blog.csdn.net/VictoriaW/article/details/72872036

Welcome to reprint , Reprint please indicate : This article from the Bin The column blog.csdn.NET/xbinworld. Technical communication QQ Group :433250724, Welcome to the algorithm . Technology interested students join .

