This is my participation 8 The fourth of the yuegengwen challenge 2 God , Check out the activity details ：8 Yuegengwen challenge
1. Common loss function , Common activation functions ,ELU function Understand? ？
Common loss function ：0-1 Loss function , Absolute loss function ,log Logarithmic loss function , Square loss function , Exponential loss function ,hinge Loss function , Cross entropy loss function, etc .
0-1 Loss function
Absolute loss function
log Logarithmic loss function
Square loss function
Exponential loss function
hinge Loss function
Cross entropy loss function
Common activation functions are ：Sigmoid、Tanh、ReLU、Leaky ReLU
Sigmoid function ：
It can transform the continuous real value of the input into 0 and 1 Between the output , Special , If it's a very large negative number , So the output is 0; If it's a very large positive number , Output is 1.
shortcoming 1： In the depth neural network, the gradient disappears when the gradient is transferred back , The probability of gradient explosion is very small , The probability of gradient disappearance is relatively large .
shortcoming 2：Sigmoid Of output No 0 mean value （ namely zero-centered）.
shortcoming 3： Its analytical formula contains power operation , It's relatively time-consuming to solve by computer . For large-scale deep networks , This will greatly increase the training time .
Tanh function ：
characteristic ： It solves Sigmoid Function is not zero-centered Output problems , Convergence rate ratio sigmoid Be quick , However , The gradient disappears （gradient vanishing） And the problem of power operation still exists .
ReLU function ：
1.ReLu The function uses the threshold to output the dependent variable , Therefore, the computational complexity will be lower than the other two functions （ The last two functions are exponential operations ）
2.ReLu The non saturation of function can effectively solve the problem of gradient disappearance , Provides a relatively wide activation boundary .
3.ReLU The unilateral suppression provides the sparse expression ability of the network .
ReLU The limitations of : The problem is that it will lead to neuron death during training .
This is due to the function f(x)=max(0,x) Causes a negative gradient after passing through the ReLU Unit is set to 0, And will not be activated by any data later , That is, the gradient flowing through the neuron is always 0, No response to any data . In actual training , If learning rate （Learning Rate） Set large , It will cause more than a certain proportion of neurons to die irreversibly , Furthermore, the parameter gradient cannot be updated , The whole training process failed .
Leaky ReLu function ：
LReLU And ReLU The difference is that , When z<0 Its value is not 0, It's a slope of a The linear function of , commonly a For a small normal number , In this way, unilateral inhibition is realized , Some negative gradient information is retained so that it is not completely lost . But on the other hand ,a The choice of value increases the difficulty of the problem , Strong artificial prior or repeated training is needed to determine the appropriate parameter value .
Based on this , A parameterized PReLU（Parametric ReLU） emerge as the times require . It is associated with LReLU The main difference is that the slope of the negative axis part a As a learnable parameter in the network , Back propagation training , Joint optimization with other network layers with parameters . And another one. LReLU Variants of have increased “ randomization ” Mechanism , In particular , In the process of training , Slope a As a random sample satisfying a certain distribution ; Then fix it during the test .Random ReLU（RReLU） To some extent, it can play the role of regularization .
ELU function ：
ELU The function is for ReLU An improved form of function , Compared with ReLU function , When the input is negative , There is a certain output , And this part of the output also has a certain anti-interference ability . This can eliminate ReLU Dead questions , But there are still problems with gradient saturation and exponential operations .
2. Why not MSE, Instead, we use cross entropy .
LR The basic expression of is as follows ：
The derivative of gradient descent update using cross entropy as loss function is as follows ：
First, the loss function is obtained as follows ：
If we use MSE As a loss function , The loss function and the result of derivation are as follows ：
Use the square loss function , You will find that the gradient update speed and sigmod The gradient of the function itself is very relevant .sigmod The gradient of a function in its domain is not greater than 0.25. This training will be very slow . Using cross entropy, this would not happen , Its derivative is a difference , If the error is large, the update is fast , If the error is small, update slowly , This is exactly what we want .
In the use of Sigmoid Function as the probability of a positive sample , At the same time, the square loss is taken as the loss function , At this time, the constructed loss function is nonconvex , Not easy to solve , It is easy to obtain its local optimal solution . If maximum likelihood is used , The objective function is the log likelihood function , The loss function is a higher-order continuously differentiable convex function with respect to unknown parameters , It is convenient to find the global optimal solution .（ About whether it is a convex function , From the definition of convex function , For unary functions , The second derivative is always nonnegative , For multivariate functions , Its Hessian matrix （Hessian A matrix is a square matrix composed of the second derivative of a multivariate function ） To judge . If Hessian A matrix is a semidefinite matrix ）
3.F1score Calculation formula .
To calculate F1score, First calculate Precision and Recall, The formula is as follows ：
4.FM vs SVM Comparison .
FM and SVM The biggest difference , The calculation method of weight in feature combination
- SVM The binary characteristic cross parameters are independent , and FM The binary characteristic cross parameter of is two k Dimension vector v_i,v_j , Cross parameters are not independent , But mutual influence
- FM Optimization learning can be carried out in the original form , And based on kernel The nonlinearity of SVM It usually needs to be done in dual form
- FM The model prediction is independent of the training samples , and SVM It is related to some training samples , Support vector .
FM The model has two advantages ：
In the case of high sparsity, the intersection between features can still be estimated , Moreover, it can be generalized to the learning of unobserved cross parameters and the time complexity of model prediction is linear
5. Randomness of random forest .
The randomness of random forest is reflected in that the training samples of each tree are random , The split attribute set of each node in the tree is also determined by random selection . With this 2 A random guarantee , Random forest will not produce over fitting phenomenon .
Ideas ： Greedy Algorithm
Traverse each position in the array in turn , And maintain the farthest reachable location in real time rightMost, Note the current traversal position i Be within the farthest reachable position （ It's satisfaction i <= rightMost）, When rightMost Greater than or equal to At the last position of the array , It means you can reach , return True, Otherwise, return to False.
The code is as follows ：
class Solution: def canJump(self, nums: List[int]) -> bool: n = len(nums) rightMost = 0 for i in range(n): if i <= rightMost: rightMost = max(rightMost, i + nums[i]) if rightMost >= n-1: return True return False Copy code
Time complexity ：O(n), among n For the size of the array . Just access nums Array again , common n A place .
Spatial complexity ：O(1), No additional space overhead .