Spark The recommended combat series has been updated ：

Spark Recommend the actual combat series Swing Algorithm is introduced 、 Realize the actual application in Ali Feizhu 
Spark Recommend the actual combat series ALS Algorithm implementation analysis 
Spark How to use matrix operation in indirect realization i2i 
FPGrowth Algorithm principle 、Spark Implementation and application introduction 
Spark Recommended series Word2vec Algorithm is introduced 、 Implementation and application instructions 
Spark Recommend the actual combat series KMeans Introduction and cold start with bottom recall 
Spark Recommend the actual combat series LR Two implementation methods and multi classification LR Introduction to actual combat
More highlights , Please continue to pay attention and star Mark 「 Search and recommend Wiki」
This paper mainly includes the following contents ：
regression analysis
What is regression analysis
Regression analysis algorithm classification
Introduction to logical regression
Sigmoid function
LR Why use Sigmoid function
LR The algorithm principle of
mllib Medium LRWithLBFGS
ml Two categories in LR
ml Multi classification in LR
Logical regression （Logistic Regression,LR） It was earlier applied to recommendation sorting , It belongs to the linear model , Simple model , Massive discrete features can be introduced , The advantage is that the model can consider more details or individual factors . If you want to introduce nonlinear factors, you need to do feature crossover , It's easy to produce ten billion features , Long ago ctr We mainly rely on human resources to do characteristic engineering work to continuously optimize the effect .
Although it is only used in industry at present LR There are not many scenarios or businesses to sort , But for beginners , Some small and mediumsized enterprises have insufficient resources and manpower ,LR Throwing is a good choice . In this way, the ranking model can be improved by following the strategy of going online first and then iterating .
I'm learning LR Let's first understand what regression analysis is and its classification .
regression analysis
Regression analysis algorithm （Regression Analysis Algorithm） It is the most common machine learning algorithm in machine learning algorithms .
What is regression analysis
Regression analysis is the use of samples （ Known data ）, Generate a fitting equation , thus （ For unknown data ） To make predictions . For example, there is a set of random variables And another set of random variables , So study variables And The statistical method of the relationship is called regression analysis . Because here and Is a single corresponding , So here is univariate linear regression .
Regression analysis algorithm classification
Regression analysis algorithms are divided into linear regression algorithm and nonlinear regression algorithm .
1、 Linear regression
Linear regression can be divided into univariate linear regression and multivariate linear regression . Of course, the exponents of independent variables in linear regression are 1, Linearity here doesn't really mean connecting data with a line , You can also use a twodimensional plane 、 Three dimensional surfaces, etc .
Univariate linear regression ： Regression with only one independent variable . For example, the size of the house （Area） And the total price of the house （Money） The relationship between , With the area （Area） The increase of , House prices are also increasing . The only independent variable here is the area , So it's univariate linear regression .
Multiple linear regression ： Regression with independent variables greater than or equal to two . For example, the size of the house （Area）、 floor （floor） And house prices （Money） The relationship between , There are two arguments here , So it's binary linear regression .
Typical linear regression methods are as follows ：
In a statistical sense , If a regression equation is linear , Then it must be linear with respect to the parameter . If it is linear with respect to the parameter , Then even if the characteristic relative to the sample variable is quadratic or multiple , The regression model is also linear . For example, the following formula ：
You can even use logarithms or exponents to take formal features , as follows ：
2、 Non linear regression And the past self and 2020 Say goodbye ！
There is a kind of model , The regression parameters are not linear , Nor can it be changed into a linear parameter by conversion , This kind of model is called nonlinear regression model . Nonlinear regression can be divided into univariate regression and multivariate regression . The exponent of at least one independent variable in nonlinear regression is not 1. In regression analysis , When the study of causality involves only a dependent variable and an independent variable , It's called univariate regression analysis ; When the causality of the study involves dependent variables and two or more independent variables , It's called multiple regression analysis .
For example, the following two regression equations ：
Unlike the linear regression model , The characteristic factors of these nonlinear regression models correspond to more than one parameter .
3、 Generalized linear regression
Some nonlinear regression can also be analyzed by linear regression , Such nonlinear regression is called generalized linear regression . The typical representative is Logistic Regression.
I won't give you too much introduction here , More on that later .
Introduction to logical regression
Logical regression and linear regression are essentially the same , The optimal coefficients are solved by the error function , In form, it is just adding a logical function to linear regression . Compared with linear regression , Logical regression （Logistic Regression,LR） It is more suitable for models with binary dependent variables ,Logistic The regression coefficient can be used to estimate the weight ratio of each independent variable in the model .
Sigmoid function
Sigmoid function （ The haveside step function ） In the case of two classification, the output value is 0 and 1, Its mathematical expression is as follows ：
The following code can be used to show Sigmoid The image of the function .
import math
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + math.exp(x))
# python2 in range What is generated is an array ,python3 Generated in is an iterator , have access to list convert
X = list(range(10, 10))
Y = list(map(sigmoid, X))
fig = plt.figure(figsize=(4, 4))
ax = fig.add_subplot(111)
# Hide the top and right
ax.spines["top"].set_color("none")
ax.spines["right"].set_color("none")
# Move the other two axes
ax.yaxis.set_ticks_position("left")
ax.spines["left"].set_position(("data", 0))
ax.xaxis.set_ticks_position("bottom")
ax.spines["bottom"].set_position(("data", 0.5))
ax.plot(X, Y)
plt.show()
It can be seen that ,Sigmoid Functions are continuous 、 smooth 、 Strictly monotonous , With (0,0.5) It's central symmetry , Is a very good threshold function . When Approaching negative infinity , Tend to be 0; When Approaching the positive infinity , Tend to be 1; when , . Of course , stay beyond [6,6] After the scope of , The value of the function basically doesn't change , Values are very close , In general, it is not considered in application .
Sigmoid The value range of the function is limited to (0,1) Between ,[0,1] Corresponding to the range of probability values , such Sigmoid The function can be associated with a probability distribution .
Sigmoid The derivative of a function is its own function , namely ݂ , The calculation process is very convenient , It also saves time . The derivation process is as follows ：
LR Why use Sigmoid function
Only the case of two categories is discussed here . First LR There is only one hypothesis , That is, the characteristics of the two categories obey the unequal mean 、 Gaussian distribution with equal variance , That is to say ：
Why suppose they obey Gaussian distribution ？ On the one hand, Gaussian distribution is easy to understand ; On the other hand, from the perspective of information theory , When the mean and variance are known （ Although the exact mean and variance are not known , But according to probability theory , When the sample size is large enough , The sample mean and variance are expressed in probability 1 Tend to mean and variance ）, Gaussian distribution is the distribution with the largest entropy . Why the maximum entropy ？ Because the distribution with the largest entropy can share the risk . Think about the binary search , Why do you choose the middle point as the search point every time ？ Just to share the risk . The assumption of equal variance is for the convenience of later processing , If they are not equal, they cannot be eliminated .
First define “ risk ” by ：
In style , Is to predict the sample as 0 The risk of , Is to predict the sample as 1 The risk of , Yes, the sample is the actual label But predict it as The risk of .
stay LR In the algorithm, , I don't think the correct prediction will bring risks , therefore and All for 0, Besides , Think the label is 0 And the forecast is 1 And the label is 1 And the forecast is 0, The risks of both are the same , therefore and equal . Convenience , Write it down as .
So as defined above “ risk ” It can be reduced to ：
Now comes the question , For a sample , It should be predicted as 0 Or predicted as 1 good ？ According to the principle of risk minimization , You should choose the one with the least risk , That is to say , When when , Forecast as 0 The risk is less than predicted to be 1 The risk of , namely ܲ when , The sample should be predicted as 0. namely ： Compare two conditional probabilities , And assign the samples to the category with the greatest probability .
In style Divide both sides at the same time Available ：
Take the logarithm of the left part of the inequality （ Why take logarithm ？ Because I mentioned before , The characteristics of the two categories obey the unequal mean 、 Gaussian distribution with equal variance , Taking logarithm is convenient to deal with the exponent in Gaussian distribution ）, Then use Bayesian formula to expand , The normalization constant is ignored , Will get ：
Convenience , hypothesis It's onedimensional （ Of course, it can be easily extended to multidimensional situations ）, Insert the Gaussian distribution formula , Besides , because and It's all constant , The second term is abbreviated as constant Continue to expand , Will get ：
take ：
Take the index on both sides , And make use of This probability formula is simplified , available ：
The calculation process is ：
To sum up, we can know why LR The algorithm uses Sigmoid Function .
LR The algorithm principle of
1、 Algorithm principle
Machine learning model actually limits the decision function to a certain set of conditions , This set of constraints determines the assumptions of the model
Space . Of course , We also hope that this set of limiting conditions is simple and reasonable .
The assumption made by the logistic regression model is ：
there Namely Sigmoid function , The corresponding decision function is ：
choice 0.5 As a threshold, it is a general practice , In practice , Different thresholds can be selected in specific cases . If the accuracy of positive cases is high , You can make the threshold larger ; If the recall requirements for positive cases are high , You can make the threshold smaller .
After the mathematical form of the function is determined , It's time to solve the parameters in the model . A mathematical method commonly used in statistics is maximum likelihood estimation , That is to find a set of parameters , So that the likelihood of data under this set of parameters （ probability ） Bigger . In the logistic regression algorithm , The likelihood function can be expressed as ：
Take the logarithm , We can get the likelihood function in logarithmic form ：
Similarly, the loss function is also used to measure the accuracy of the model prediction results , Here the Loss function , Its definition on a single piece of data is ：
If you take the average of the whole dataset Loss , You can get ：
In the logistic regression model , Maximize likelihood function and minimize lg The loss function is actually equivalent . For this optimization problem , There are many solutions , Here, take the gradient descent as an example to illustrate . The basic steps are as follows ：

（1） Choose the direction of descent （ Gradient direction ： ）;

（2） Choose the step size , Update parameters );

（3） Repeat the above two steps until the termination condition is met .
The gradient calculation method of loss function is ：
Choosing a smaller step along the negative direction of the gradient can ensure that the value of the loss function is reduced , in addition , The loss function of logistic regression model is convex （ The regular term is strictly convex ）, It can be guaranteed that the local optimal value found is the global optimal value .
2、 Regularization
When there are too many parameters in the model , Easy to produce over fitting , At this time, we need to control the complexity of the model , The most common approach is to add regular items to the target , To prevent over fitting by penalizing too large a parameter .
Common regularization methods include Regularization and Regularization . They correspond to the following two formulas ：

Regularization refers to the weight vector The sum of the absolute values of each element in , Usually expressed as .

Regularization refers to the weight vector The sum of the squares of the elements in and then the square root （ You can see Ridge Return to
Of Regularization term has square sign ）, Usually expressed as .
mllib Medium LRWithLBFGS
stay Spark.mllib The package provides two LR Classification model , Namely ：

minibatch gradient descent（LogisticRegressionWithLBFGS） 
LBFGS（LogisticRegressionWithSGD）
But the official advice is ： Recommended LBFGS, Because it is based on LBFGS Of LR Bi Ji Yu SGD Can converge faster . The original words are as follows ：
We implemented two algorithms to solve logistic regression: minibatch gradient descent and LBFGS. We recommend LBFGS over minibatch gradient descent for faster convergence.
and LRWithLBFGS It supports not only two classifications but also multiple classifications , but LRWithSGD Only two categories are supported . So I'll just introduce Spark mllib Medium LogisticRegressionWithLBFGS The relevant operation .
Set variables and create spark object
val file = "data/sample_libsvm_data.txt"
val model_path = "model/lr/"
val model_param = "numInterations:5,regParam:0.1,updater:SquaredL2Updater,gradient:LogisticGradient"
val spark = SparkSession.builder()
.master("local[5]")
.appName("LogisticRegression_Model_Train")
.getOrCreate()
Logger.getRootLogger.setLevel(Level.WARN)
Split datasets
// Record data sets And split into training set and test set
val data = MLUtils.loadLibSVMFile(spark.sparkContext,file).randomSplit(Array(0.7,0.3))
val (train, test) = (data(0), data(1))
LRWithLBFGS Model setting parameters
// Define the number of categories , The default is 2, yes logisticregression Parameters of
private var numClass: Int = 2
// Define whether to add intercept , The default value is false, yes logisticregression Parameters of
private var isAddIntercept: Option[Boolean] = None
// Define whether to validate before training the model , yes logisticregression Parameters of
private var isValidateData: Option[Boolean] = None
// Define the number of iterations , The default value is 100,LBFGS Parameters of
private var numInterations: Option[Int] = None
// Define the regularization coefficient value , The default value is 0.0,LBFGS Parameters of
private var regParam: Option[Double] = None
// Define regularization parameters , Support ：L1Updater[L1]、SquaredL2Updater[L2]、SimpleUpdater[ There is no regular term ],LBFGS Parameters of
private var updater: Option[String] = None
// Define how the gradient is calculated , Support ：LogisticGradient、LeastSquaresGradient、HingeGradient ,LBFGS Parameters of
private var gradient: Option[String] = None
// Artificially defined convergence threshold
private var threshold:Option[Double]=None
// Define the model convergence threshold , The default is 10^6
private var convergenceTol: Double= 1.0e6
Creating models
def createLRModel(model_param: String): LogisticRegressionWithLBFGS={
// Set model parameters
val optimizer = new LROptimizer()
optimizer.parseString(model_param)
println(s" The model training parameters are ：${optimizer.toString}")
// Create the model and specify relevant parameters
val LRModel = new LogisticRegressionWithLBFGS()
// Set the number of categories
LRModel.setNumClasses(optimizer.getNumClass)
// Set whether to add intercept
if(optimizer.getIsAddIntercept.nonEmpty) {LRModel.setIntercept(optimizer.getIsAddIntercept.get)}
// Set whether to validate the model
if(optimizer.getIsValidateData.nonEmpty){LRModel.setValidateData(optimizer.getIsValidateData.get)}
// Set the number of iterations
if(optimizer.getNumInterations.nonEmpty){LRModel.optimizer.setNumIterations((optimizer.getNumInterations.get))}
// Set regular item parameters
if(optimizer.getRegParam.nonEmpty) { LRModel.optimizer.setRegParam(optimizer.getRegParam.get) }
// Set the regularization parameters
if(optimizer.getUpdater.nonEmpty){
optimizer.getUpdater match {
case Some("L1Updater") => LRModel.optimizer.setUpdater( new L1Updater())
case Some("SquaredL2Updater") => LRModel.optimizer.setUpdater(new SquaredL2Updater())
case Some("SimpleUpdater") => LRModel.optimizer.setUpdater(new SimpleUpdater())
case _ => LRModel.optimizer.setUpdater(new SquaredL2Updater())
}
}
// Set the gradient calculation method
if(optimizer.getGradient.nonEmpty){
optimizer.getGradient match {
case Some("LogisticGradient") => LRModel.optimizer.setGradient(new LogisticGradient())
case Some("LeastSquaresGradient") => LRModel.optimizer.setGradient(new LeastSquaresGradient())
case Some("HingeGradient") => LRModel.optimizer.setGradient(new HingeGradient())
case _ => LRModel.optimizer.setGradient(new LogisticGradient())
}
}
// Set the convergence threshold
if(optimizer.getThreshold.nonEmpty){ LRModel.optimizer.setConvergenceTol(optimizer.getThreshold.get)}
else {LRModel.optimizer.setConvergenceTol(optimizer.getConvergenceTol)}
LRModel
}
Model effect evaluation
def evaluteResult(result: RDD[(Double,Double,Double)]) :Unit = {
// MSE
val testMSE = result.map{ case(real, pre, _) => math.pow((real  pre), 2)}.mean()
println(s"Test Mean Squared Error = $testMSE")
// AUC
val metrics = new BinaryClassificationMetrics(result.map(x => (x._2,x._1)).sortByKey(ascending = true),numBins = 2)
println(s"01 label AUC is = ${metrics.areaUnderROC}")
val metrics1 = new BinaryClassificationMetrics(result.map(x => (x._3,x._1)).sortByKey(ascending = true),numBins = 2)
println(s"scorelabel AUC is = ${metrics1.areaUnderROC}")
// Error rate
val error = result.filter(x => x._1!=x._2).count().toDouble / result.count()
println(s"error is = $error")
// Accuracy rate
val accuracy = result.filter(x => x._1==x._2).count().toDouble / result.count()
println(s"accuracy is = $accuracy")
}
Save the model
def saveModel(model: LogisticRegressionModel, model_path: String): Unit = {
// Save model file obj
val out_obj = new ObjectOutputStream(new FileOutputStream(model_path+"model.obj"))
out_obj.writeObject(model)
// Save model information
val model_info=new BufferedWriter(new FileWriter(model_path+"model_info.txt"))
model_info.write(model.toString())
model_info.flush()
model_info.close()
// Save model weights
val model_weights=new BufferedWriter(new FileWriter(model_path+"model_weights.txt"))
model_weights.write(model.weights.toString)
model_weights.flush()
model_weights.close()
println(s" The model information is written to the file , Path is ：$model_path")
}
Load model
def loadModel(model_path: String): Option[LogisticRegressionModel] = {
try{
val in = new ObjectInputStream( new FileInputStream(model_path) )
val model = Option( in.readObject().asInstanceOf[LogisticRegressionModel] )
in.close()
println("Model Load Success")
model
}
catch {
case ex: ClassNotFoundException => {
println(ex.printStackTrace())
None
}
case ex: IOException => {
println(ex.printStackTrace())
println(ex)
None
}
case _: Throwable => throw new Exception
}
}
Use the loaded model for score calculation
// load obj File predictions
val model_new = loadModel(s"$model_path/model.obj")
// Use the loaded model for sample prediction
val result_new = test.map(line =>{
val pre_label = model_new.get.predict(line.features)
// blas.ddot(x.length, x,1,y,1) ( vector x The length of , vector x, vector x Index increment interval , vector y, vector y Index increment interval )
val pre_score = blas.ddot(model.numFeatures, line.features.toArray, 1, model.weights.toArray, 1)
val score = Math.pow(1+Math.pow(Math.E, 2 * pre_score), 1)
(line.label, pre_label,score)
} )
result_new.take(2).foreach(println)
ml Two categories in LR
ml In bag LR It can be used for two categories , It can also be used for multi classification .

Two categories correspond to ：Binomial logistic regression 
Multi category correspondence ：multinomial logistic regression
Two of them can be classified by Binomial logistic regression and multinomial logistic regression Realization .
be based on Binomial logistic regression Of LR Realization ：
def BinaryModel(train: Dataset[Row], model_path: String, spark: SparkSession) = {
// Creating models
val LRModel = new LogisticRegression()
.setMaxIter(20)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Training evaluation model
val model = LRModel.fit(train)
evalute(model, train, spark)
}
def evalute(model: LogisticRegressionModel, train: Dataset[Row], spark: SparkSession):Unit = {
// Print model parameters
println(s" The model parameter information is as follows ：\n ${model.parent.explainParams()} \n")
println(s"Coefficients（ coefficient ）: ${model.coefficients}")
println(s"Intercept（ intercept ）: ${model.intercept}")
// View the prediction results of the training set rawPrediction：row Calculated score ,probability： after sigmoid Probability after conversion
val result = model.evaluate(train)
result.predictions.show(10)
// take label,0 Value probability ,predict label extracted
result.predictions.select("label","probability","prediction").rdd
.map(row => (row.getDouble(0),row.get(1).asInstanceOf[DenseVector].toArray(0),row.getDouble(2)))
.take(10).foreach(println)
// Model to evaluate
val trainSummary = model.summary
val objectiveHistory = trainSummary.objectiveHistory
println("objectiveHistoryLoss:")
objectiveHistory.foreach(loss => println(loss))
val binarySummary = trainSummary.asInstanceOf[BinaryLogisticRegressionSummary]
val roc = binarySummary.roc
roc.show()
println(s"areaUnderROC: ${binarySummary.areaUnderROC}")
// Set the model threshold to maximize FMeasure
val fMeasure = binarySummary.fMeasureByThreshold
fMeasure.show(10)
val maxFMeasure = fMeasure.select(max("FMeasure")).head().getDouble(0)
import spark.implicits ._
val bestThreshold = fMeasure.where($"FMeasure"===maxFMeasure).select("threshold").head().getDouble(0)
model.setThreshold(bestThreshold)
}
be based on Multimial logistic regression Of LR Realization ：
def BinaryModelWithMulti(train: Dataset[Row], model_path: String, spark: SparkSession) = {
// Creating models
val LRModel = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFamily("multinomial")
// Training models
val model = LRModel.fit(train)
// Print model parameters
println(s" The model parameter information is as follows ：\n ${model.parent.explainParams()} \n")
println(s"Coefficients（ coefficient ）: ${model.coefficientMatrix}")
println(s"Intercept（ intercept ）: ${model.interceptVector}")
}
ml Multi classification in LR
A sample belongs to category k The probability of is calculated as ：
among K Presentation category , Number of features
Weight minimization uses the maximum likelihood function , The updated formula is as follows ：
The data set used is in the form of ：
1 1:0.222222 2:0.5 3:0.762712 4:0.833333
1 1:0.555556 2:0.25 3:0.864407 4:0.916667
1 1:0.722222 2:0.166667 3:0.864407 4:0.833333
1 1:0.722222 2:0.166667 3:0.694915 4:0.916667
0 1:0.166667 2:0.416667 3:0.457627 4:0.5
1 1:0.833333 3:0.864407 4:0.916667
2 1:1.32455e07 2:0.166667 3:0.220339 4:0.0833333
2 1:1.32455e07 2:0.333333 3:0.0169491 4:4.03573e08
Many classification LR The model is implemented as ：
def MultiModel(file_multi: String, spark: SparkSession, model_path: String): Unit = {
val training = spark.read.format("libsvm").load(file_multi)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: ${lrModel.interceptVector}")
}
— End —