Spark recommends two implementation methods of LR in the actual combat series and the introduction of multi category LR in actual combat

Search and recommend wikis 2021-10-14 07:57:14

Spark The recommended combat series has been updated :

More highlights , Please continue to pay attention and star Mark 「 Search and recommend Wiki

This paper mainly includes the following contents :

  • regression analysis

    • What is regression analysis

    • Regression analysis algorithm classification

  • Introduction to logical regression

    • Sigmoid function

    • LR Why use Sigmoid function

    • LR The algorithm principle of

  • mllib Medium LRWithLBFGS

  • ml Two categories in LR

  • ml Multi classification in LR

Logical regression (Logistic Regression,LR) It was earlier applied to recommendation sorting , It belongs to the linear model , Simple model , Massive discrete features can be introduced , The advantage is that the model can consider more details or individual factors . If you want to introduce nonlinear factors, you need to do feature crossover , It's easy to produce ten billion features , Long ago ctr We mainly rely on human resources to do characteristic engineering work to continuously optimize the effect .

Although it is only used in industry at present LR There are not many scenarios or businesses to sort , But for beginners , Some small and medium-sized enterprises have insufficient resources and manpower ,LR Throwing is a good choice . In this way, the ranking model can be improved by following the strategy of going online first and then iterating .

I'm learning LR Let's first understand what regression analysis is and its classification .

regression analysis

Regression analysis algorithm (Regression Analysis Algorithm) It is the most common machine learning algorithm in machine learning algorithms .

What is regression analysis

Regression analysis is the use of samples ( Known data ), Generate a fitting equation , thus ( For unknown data ) To make predictions . For example, there is a set of random variables And another set of random variables , So study variables And The statistical method of the relationship is called regression analysis . Because here and Is a single corresponding , So here is univariate linear regression .

Regression analysis algorithm classification

Regression analysis algorithms are divided into linear regression algorithm and nonlinear regression algorithm .

1、 Linear regression

Linear regression can be divided into univariate linear regression and multivariate linear regression . Of course, the exponents of independent variables in linear regression are 1, Linearity here doesn't really mean connecting data with a line , You can also use a two-dimensional plane 、 Three dimensional surfaces, etc .

Univariate linear regression : Regression with only one independent variable . For example, the size of the house (Area) And the total price of the house (Money) The relationship between , With the area (Area) The increase of , House prices are also increasing . The only independent variable here is the area , So it's univariate linear regression .

Multiple linear regression : Regression with independent variables greater than or equal to two . For example, the size of the house (Area)、 floor (floor) And house prices (Money) The relationship between , There are two arguments here , So it's binary linear regression .

Typical linear regression methods are as follows :

In a statistical sense , If a regression equation is linear , Then it must be linear with respect to the parameter . If it is linear with respect to the parameter , Then even if the characteristic relative to the sample variable is quadratic or multiple , The regression model is also linear . For example, the following formula :

You can even use logarithms or exponents to take formal features , as follows :

2、 Non linear regression And the past self and 2020 Say goodbye !

There is a kind of model , The regression parameters are not linear , Nor can it be changed into a linear parameter by conversion , This kind of model is called nonlinear regression model . Nonlinear regression can be divided into univariate regression and multivariate regression . The exponent of at least one independent variable in nonlinear regression is not 1. In regression analysis , When the study of causality involves only a dependent variable and an independent variable , It's called univariate regression analysis ; When the causality of the study involves dependent variables and two or more independent variables , It's called multiple regression analysis .

For example, the following two regression equations :

Unlike the linear regression model , The characteristic factors of these nonlinear regression models correspond to more than one parameter .

3、 Generalized linear regression

Some nonlinear regression can also be analyzed by linear regression , Such nonlinear regression is called generalized linear regression . The typical representative is Logistic Regression.

I won't give you too much introduction here , More on that later .

Introduction to logical regression

Logical regression and linear regression are essentially the same , The optimal coefficients are solved by the error function , In form, it is just adding a logical function to linear regression . Compared with linear regression , Logical regression (Logistic Regression,LR) It is more suitable for models with binary dependent variables ,Logistic The regression coefficient can be used to estimate the weight ratio of each independent variable in the model .

Sigmoid function

Sigmoid function ( The haveside step function ) In the case of two classification, the output value is 0 and 1, Its mathematical expression is as follows :

The following code can be used to show Sigmoid The image of the function .

import math 
import matplotlib.pyplot as plt 
def sigmoid(x): 
    return 1 / (1 + math.exp(-x))

# python2  in  range  What is generated is an array ,python3  Generated in is an iterator , have access to  list  convert
X = list(range(-1010)) 
Y = list(map(sigmoid, X)) 
fig = plt.figure(figsize=(44)) 
ax = fig.add_subplot(111

#  Hide the top and right

#  Move the other two axes
ax.plot(X, Y)
Sigmoid Function diagram

It can be seen that ,Sigmoid Functions are continuous 、 smooth 、 Strictly monotonous , With (0,0.5) It's central symmetry , Is a very good threshold function . When Approaching negative infinity , Tend to be 0; When Approaching the positive infinity , Tend to be 1; when , . Of course , stay beyond [-6,6] After the scope of , The value of the function basically doesn't change , Values are very close , In general, it is not considered in application .

Sigmoid The value range of the function is limited to (0,1) Between ,[0,1] Corresponding to the range of probability values , such Sigmoid The function can be associated with a probability distribution .

Sigmoid The derivative of a function is its own function , namely ݂ , The calculation process is very convenient , It also saves time . The derivation process is as follows :

LR Why use Sigmoid function

Only the case of two categories is discussed here . First LR There is only one hypothesis , That is, the characteristics of the two categories obey the unequal mean 、 Gaussian distribution with equal variance , That is to say :

Why suppose they obey Gaussian distribution ? On the one hand, Gaussian distribution is easy to understand ; On the other hand, from the perspective of information theory , When the mean and variance are known ( Although the exact mean and variance are not known , But according to probability theory , When the sample size is large enough , The sample mean and variance are expressed in probability 1 Tend to mean and variance ), Gaussian distribution is the distribution with the largest entropy . Why the maximum entropy ? Because the distribution with the largest entropy can share the risk . Think about the binary search , Why do you choose the middle point as the search point every time ? Just to share the risk . The assumption of equal variance is for the convenience of later processing , If they are not equal, they cannot be eliminated .

First define “ risk ” by :

In style , Is to predict the sample as 0 The risk of , Is to predict the sample as 1 The risk of , Yes, the sample is the actual label But predict it as The risk of .

stay LR In the algorithm, , I don't think the correct prediction will bring risks , therefore and All for 0, Besides , Think the label is 0 And the forecast is 1 And the label is 1 And the forecast is 0, The risks of both are the same , therefore and equal . Convenience , Write it down as .

So as defined above “ risk ” It can be reduced to :

Now comes the question , For a sample , It should be predicted as 0 Or predicted as 1 good ? According to the principle of risk minimization , You should choose the one with the least risk , That is to say , When when , Forecast as 0 The risk is less than predicted to be 1 The risk of , namely ܲ when , The sample should be predicted as 0. namely : Compare two conditional probabilities , And assign the samples to the category with the greatest probability .

In style Divide both sides at the same time Available :

Take the logarithm of the left part of the inequality ( Why take logarithm ? Because I mentioned before , The characteristics of the two categories obey the unequal mean 、 Gaussian distribution with equal variance , Taking logarithm is convenient to deal with the exponent in Gaussian distribution ), Then use Bayesian formula to expand , The normalization constant is ignored , Will get :

Convenience , hypothesis It's one-dimensional ( Of course, it can be easily extended to multi-dimensional situations ), Insert the Gaussian distribution formula , Besides , because and It's all constant , The second term is abbreviated as constant Continue to expand , Will get :

take :

Take the index on both sides , And make use of This probability formula is simplified , available :

The calculation process is :

To sum up, we can know why LR The algorithm uses Sigmoid Function .

LR The algorithm principle of

1、 Algorithm principle

Machine learning model actually limits the decision function to a certain set of conditions , This set of constraints determines the assumptions of the model

Space . Of course , We also hope that this set of limiting conditions is simple and reasonable .

The assumption made by the logistic regression model is :

there Namely Sigmoid function , The corresponding decision function is :

choice 0.5 As a threshold, it is a general practice , In practice , Different thresholds can be selected in specific cases . If the accuracy of positive cases is high , You can make the threshold larger ; If the recall requirements for positive cases are high , You can make the threshold smaller .

After the mathematical form of the function is determined , It's time to solve the parameters in the model . A mathematical method commonly used in statistics is maximum likelihood estimation , That is to find a set of parameters , So that the likelihood of data under this set of parameters ( probability ) Bigger . In the logistic regression algorithm , The likelihood function can be expressed as :

Take the logarithm , We can get the likelihood function in logarithmic form :

Similarly, the loss function is also used to measure the accuracy of the model prediction results , Here the Loss function , Its definition on a single piece of data is :

If you take the average of the whole dataset Loss , You can get :

In the logistic regression model , Maximize likelihood function and minimize lg The loss function is actually equivalent . For this optimization problem , There are many solutions , Here, take the gradient descent as an example to illustrate . The basic steps are as follows :

  • (1) Choose the direction of descent ( Gradient direction : );

  • (2) Choose the step size , Update parameters );

  • (3) Repeat the above two steps until the termination condition is met .

The gradient calculation method of loss function is :

Choosing a smaller step along the negative direction of the gradient can ensure that the value of the loss function is reduced , in addition , The loss function of logistic regression model is convex ( The regular term is strictly convex ), It can be guaranteed that the local optimal value found is the global optimal value .

2、 Regularization

When there are too many parameters in the model , Easy to produce over fitting , At this time, we need to control the complexity of the model , The most common approach is to add regular items to the target , To prevent over fitting by penalizing too large a parameter .

Common regularization methods include Regularization and Regularization . They correspond to the following two formulas :

  • Regularization refers to the weight vector The sum of the absolute values of each element in , Usually expressed as .

  • Regularization refers to the weight vector The sum of the squares of the elements in and then the square root ( You can see Ridge Return to

Of Regularization term has square sign ), Usually expressed as .

mllib Medium LRWithLBFGS

stay Spark.mllib The package provides two LR Classification model , Namely :

  • mini-batch gradient descent(LogisticRegressionWithLBFGS)
  • L-BFGS(LogisticRegressionWithSGD)

But the official advice is : Recommended LBFGS, Because it is based on LBFGS Of LR Bi Ji Yu SGD Can converge faster . The original words are as follows :

We implemented two algorithms to solve logistic regression: mini-batch gradient descent and L-BFGS. We recommend L-BFGS over mini-batch gradient descent for faster convergence.

and LRWithLBFGS It supports not only two classifications but also multiple classifications , but LRWithSGD Only two categories are supported . So I'll just introduce Spark mllib Medium LogisticRegressionWithLBFGS The relevant operation .

Set variables and create spark object

val file = "data/sample_libsvm_data.txt"
val model_path = "model/lr/"
val model_param = "numInterations:5,regParam:0.1,updater:SquaredL2Updater,gradient:LogisticGradient"

val spark = SparkSession.builder()

Split datasets

//  Record data sets   And split into training set and test set 
val data = MLUtils.loadLibSVMFile(spark.sparkContext,file).randomSplit(Array(0.7,0.3))
val (train, test) = (data(0), data(1))

LRWithLBFGS Model setting parameters

//  Define the number of categories , The default is 2, yes logisticregression Parameters of 
private var numClass: Int = 2
//  Define whether to add intercept , The default value is false, yes logisticregression Parameters of
private var isAddIntercept: Option[Boolean] = None
//  Define whether to validate before training the model , yes logisticregression Parameters of
private var isValidateData: Option[Boolean] = None

//  Define the number of iterations , The default value is 100,LBFGS Parameters of
private var numInterations: Option[Int] = None
//  Define the regularization coefficient value , The default value is 0.0,LBFGS Parameters of
private var regParam: Option[Double] = None
//  Define regularization parameters , Support :L1Updater[L1]、SquaredL2Updater[L2]、SimpleUpdater[ There is no regular term ],LBFGS Parameters of
private var updater: Option[String] = None
//  Define how the gradient is calculated , Support :LogisticGradient、LeastSquaresGradient、HingeGradient ,LBFGS Parameters of
private var gradient: Option[String] = None
//  Artificially defined convergence threshold
private var threshold:Option[Double]=None
//  Define the model convergence threshold , The default is  10^-6
private var convergenceTol: Double= 1.0e-6

Creating models

def createLRModel(model_param: String): LogisticRegressionWithLBFGS={
 //  Set model parameters
 val optimizer = new LROptimizer()
 println(s" The model training parameters are :${optimizer.toString}")

 //  Create the model and specify relevant parameters
 val LRModel = new LogisticRegressionWithLBFGS()
 //  Set the number of categories
 //  Set whether to add intercept
 if(optimizer.getIsAddIntercept.nonEmpty) {LRModel.setIntercept(optimizer.getIsAddIntercept.get)}
 //  Set whether to validate the model
 //  Set the number of iterations
 //  Set regular item parameters
 if(optimizer.getRegParam.nonEmpty) { LRModel.optimizer.setRegParam(optimizer.getRegParam.get) }
 //  Set the regularization parameters
  optimizer.getUpdater match {
   case Some("L1Updater") => LRModel.optimizer.setUpdater( new L1Updater())
   case Some("SquaredL2Updater") => LRModel.optimizer.setUpdater(new SquaredL2Updater())
   case Some("SimpleUpdater") => LRModel.optimizer.setUpdater(new SimpleUpdater())
   case _ => LRModel.optimizer.setUpdater(new SquaredL2Updater())
 //  Set the gradient calculation method
  optimizer.getGradient match {
   case Some("LogisticGradient") => LRModel.optimizer.setGradient(new LogisticGradient())
   case Some("LeastSquaresGradient") => LRModel.optimizer.setGradient(new LeastSquaresGradient())
   case Some("HingeGradient") => LRModel.optimizer.setGradient(new HingeGradient())
   case _ => LRModel.optimizer.setGradient(new LogisticGradient())
 //  Set the convergence threshold
 if(optimizer.getThreshold.nonEmpty){ LRModel.optimizer.setConvergenceTol(optimizer.getThreshold.get)}
 else {LRModel.optimizer.setConvergenceTol(optimizer.getConvergenceTol)}


Model effect evaluation

 def evaluteResult(result: RDD[(Double,Double,Double)]) :Unit = {
  // MSE
  val testMSE ={ case(real, pre, _) => math.pow((real - pre), 2)}.mean()
  println(s"Test Mean Squared Error = $testMSE")
  // AUC
  val metrics = new BinaryClassificationMetrics( => (x._2,x._1)).sortByKey(ascending = true),numBins = 2)
  println(s"0-1 label AUC is = ${metrics.areaUnderROC}")
  val metrics1 = new BinaryClassificationMetrics( => (x._3,x._1)).sortByKey(ascending = true),numBins = 2)
  println(s"score-label AUC is = ${metrics1.areaUnderROC}")
  //  Error rate
  val error = result.filter(x => x._1!=x._2).count().toDouble / result.count()
  println(s"error is = $error")
  //  Accuracy rate
  val accuracy = result.filter(x => x._1==x._2).count().toDouble / result.count()
  println(s"accuracy is = $accuracy")

Save the model

 def saveModel(model: LogisticRegressionModel, model_path: String): Unit = {
  //  Save model file  obj
  val out_obj = new ObjectOutputStream(new FileOutputStream(model_path+"model.obj"))

  //  Save model information
  val model_info=new BufferedWriter(new FileWriter(model_path+"model_info.txt"))

  //  Save model weights
  val model_weights=new BufferedWriter(new FileWriter(model_path+"model_weights.txt"))

  println(s" The model information is written to the file , Path is :$model_path")

Load model

 def loadModel(model_path: String): Option[LogisticRegressionModel] = {
   val in = new ObjectInputStream( new FileInputStream(model_path) )
   val model = Option( in.readObject().asInstanceOf[LogisticRegressionModel] )
   println("Model Load Success")
  catch {
   case ex: ClassNotFoundException => {
   case ex: IOException => {
   case _: Throwable => throw new Exception

Use the loaded model for score calculation

 //  load obj File predictions 
 val model_new = loadModel(s"$model_path/model.obj")
 //  Use the loaded model for sample prediction
 val result_new = =>{
  val pre_label = model_new.get.predict(line.features)
  // blas.ddot(x.length, x,1,y,1) ( vector x The length of , vector x, vector x Index increment interval , vector y, vector y Index increment interval )
  val pre_score = blas.ddot(model.numFeatures, line.features.toArray, 1, model.weights.toArray, 1)
  val score = Math.pow(1+Math.pow(Math.E, -2 * pre_score), -1)
  (line.label, pre_label,score)
 } )

ml Two categories in LR

ml In bag LR It can be used for two categories , It can also be used for multi classification .

  • Two categories correspond to :Binomial logistic regression
  • Multi category correspondence :multinomial logistic regression

Two of them can be classified by Binomial logistic regression and multinomial logistic regression Realization .

be based on Binomial logistic regression Of LR Realization :

def BinaryModel(train: Dataset[Row], model_path: String, spark: SparkSession) = {
 //  Creating models
 val LRModel = new LogisticRegression()
 //  Training evaluation model
 val model =
 evalute(model, train, spark)

def evalute(model: LogisticRegressionModel, train: Dataset[Row], spark: SparkSession):Unit = {
  //  Print model parameters
  println(s" The model parameter information is as follows :\n ${model.parent.explainParams()} \n")
  println(s"Coefficients( coefficient ): ${model.coefficients}")
  println(s"Intercept( intercept ): ${model.intercept}")
  //  View the prediction results of the training set rawPrediction:row Calculated score ,probability: after sigmoid Probability after conversion
  val result = model.evaluate(train)
  //  take  label,0  Value probability ,predict label extracted"label","probability","prediction").rdd
   .map(row => (row.getDouble(0),row.get(1).asInstanceOf[DenseVector].toArray(0),row.getDouble(2)))
  //  Model to evaluate
  val trainSummary = model.summary
  val objectiveHistory = trainSummary.objectiveHistory
  objectiveHistory.foreach(loss => println(loss))

  val binarySummary = trainSummary.asInstanceOf[BinaryLogisticRegressionSummary]

  val roc = binarySummary.roc
  println(s"areaUnderROC: ${binarySummary.areaUnderROC}")

  // Set the model threshold to maximize F-Measure
  val fMeasure = binarySummary.fMeasureByThreshold
  val maxFMeasure ="F-Measure")).head().getDouble(0)
  import spark.implicits ._
  val bestThreshold = fMeasure.where($"F-Measure"===maxFMeasure).select("threshold").head().getDouble(0)

be based on Multimial logistic regression Of LR Realization :

def BinaryModelWithMulti(train: Dataset[Row], model_path: String, spark: SparkSession) = {
 //  Creating models
 val LRModel = new LogisticRegression()
 //  Training models
 val model =
 //  Print model parameters
 println(s" The model parameter information is as follows :\n ${model.parent.explainParams()} \n")
 println(s"Coefficients( coefficient ): ${model.coefficientMatrix}")
 println(s"Intercept( intercept ): ${model.interceptVector}")

ml Multi classification in LR

A sample belongs to category k The probability of is calculated as :

among K Presentation category , Number of features

Weight minimization uses the maximum likelihood function , The updated formula is as follows :

The data set used is in the form of :

1 1:-0.222222 2:0.5 3:-0.762712 4:-0.833333
1 1:-0.555556 2:0.25 3:-0.864407 4:-0.916667
1 1:-0.722222 2:-0.166667 3:-0.864407 4:-0.833333
1 1:-0.722222 2:0.166667 3:-0.694915 4:-0.916667
0 1:0.166667 2:-0.416667 3:0.457627 4:0.5
1 1:-0.833333 3:-0.864407 4:-0.916667
2 1:-1.32455e-07 2:-0.166667 3:0.220339 4:0.0833333
2 1:-1.32455e-07 2:-0.333333 3:0.0169491 4:-4.03573e-08

Many classification LR The model is implemented as :

def MultiModel(file_multi: String, spark: SparkSession, model_path: String): Unit = {
 val training ="libsvm").load(file_multi)
 val lr = new LogisticRegression()

 // Fit the model
 val lrModel =

 // Print the coefficients and intercept for multinomial logistic regression
 println(s"Coefficients: \n${lrModel.coefficientMatrix}")
 println(s"Intercepts: ${lrModel.interceptVector}")
Pay attention to us and don't miss every wonderful article
 Search and recommend Wiki
Search and recommend Wiki
Focus on search and recommendation systems , Focus on series sharing , Continue to build quality content !
221 Original content
official account
「 Search and recommend Wiki」 Guess you like
1 How to do algorithm iteration well
2 Wide&Deep Algorithm introduction and revenue forecasting practice
3、 In the e-commerce platform 9 A common recommendation strategy
4 Turning off algorithm recommendation is like carving a boat and seeking a sword
5、 Exposure filtering mechanism in chat recommendation system
6、 Is it really necessary to report on work ?
7、 Young people fled the Siege : Old age version App It smells good



Please bring the original link to reprint ,thank
Similar articles