neural network , Or parameter initialization of deep learning algorithm is a very important aspect , The traditional initialization method initializes parameters randomly from Gaussian distribution . It's even directly initialized to 1 perhaps 0. This way, violence is direct , But often the effect is mediocre . The narration of this article comes from a discussion post abroad [1], Let's elaborate on our own understanding .

First of all, let's think about , Why in neural network algorithms ( To simplify the problem , We use the most basic DNN Think about ) in , The choice of parameters is very important ? With sigmoid function (logistic neurons) For example , When x When the absolute value of becomes larger , Function values are getting smoother , Tend to saturate , At this point, the reciprocal of the function tends to 0, for example , stay x=2 when , The derivative of the function is about 1/10, And in the x=10 when , The derivative of the function has become about 1/22000, in other words , The input to the activation function is 10 Than when 2 The learning speed of neural network is slower when the network is in the process of learning 2200 times !

In order to make the neural network learn faster , We want to activate the function sigmoid The derivative of is larger . Numerically , About sigmoid The input is in [-4,4] Between , See above . Of course , It doesn't have to be that precise . We know , A neuron j The input of is the weighted sum of the outputs of the previous layer of neurons ,xj=∑iai⋅wi+bj. therefore , We can control the range of the initial value of the weight parameter , So that the input of neurons falls within the range we need .

One is relatively simple 、 The effective way is : The weight parameter initialization is uniformly random from the interval .

(−1d√,1d√), among d It's the number of inputs to a neuron .

In order to show the rationality of this value , Let's briefly review some basic knowledge :

1. Consistent with uniform distribution U(a,b) The mathematical expectation and variance of random variables are respectively —— Mathematical expectation :E(X)=(a+b)/2, variance :D(X)=(b-a)²/12

2. If the random variable X,Y It's independent of each other , that Var(X+Y) = Var(X)+Var(Y), If X,Y Is independent of each other and the mean value is 0, that Var(X*Y) = Var(X)*Var(Y)

therefore , If we limit the input of neurons (xi) It's the mean =0, Standard deviation =1 Of , that


in other words , Random d A weighted sum of input signals , The weight comes from (−1d√,1d√) Uniform distribution , Obey the mean =0, variance =1/3 Is a normal distribution , And with the d irrelevant . So the input to the neuron falls in the interval [-4,4] The probability outside is very small .

A more general form can be written as :


Another relatively new initial value method

according to Glorot & Bengio (2010) [4], initialize the weights uniformly within the interval [−b,b], where


Hk and Hk+1 are the sizes of the layers before and after the weight matrix, for sigmoid units. Or hyperbolic tangent units: sample a Uniform [−b,b] with


Initial value methods for other scenarios [2]

  • in the case of RBMs, a zero-mean Gaussian with a small standard deviation around 0.1 or 0.01 works well (Hinton, 2010) to initialize the weights.

  • Orthogonal random matrix initialization, i.e. W = np.random.randn(ndim, ndim); u, s, v = np.linalg.svd(W) then use u as your initialization matrix.

