# Briefly talk about the performance evaluation criteria of the model

lc013 2021-09-15 10:44:14

Introduction to machine learning series （2）– How to build a complete machine learning project , Chapter 10 ！

The front of the series 9 An article ：

• Introduction to machine learning series (2)– How to build a complete machine learning project ( One )
• Acquisition of machine learning data set and construction method of test set
• Data preprocessing of feature engineering （ On ）
• Data preprocessing of feature engineering （ Next ）
• Feature scaling in feature engineering & Feature code
• Feature Engineering ( End )
• Summary and comparison of common machine learning algorithms ( On ）
• Summary and comparison of common machine learning algorithms ( in ）
• Summary and comparison of common machine learning algorithms ( End ）

This series of articles is about to begin and end , Finally, it is mainly the content of model evaluation .

In the field of machine learning , The evaluation of the model is very important , Only select the evaluation method that matches the problem , In order to quickly find the problems of algorithm model or training process , Optimize the model iteratively .

Model evaluation is divided into two stages: offline evaluation and online evaluation . And for classification 、 Return to 、 Sort 、 Sequence prediction and other different types of machine learning problems , The selection of evaluation indicators is also different .

The model evaluation section will introduce the following aspects ：

• Performance metrics
• Model evaluation method
• Generalization ability
• Over fitting 、 Under fitting
• Super parameter tuning

This article will first introduce the content of performance measurement , It is mainly the performance index of classification problem and regression problem , Including the introduction of the following methods ：

• Accuracy and error rates
• Accuracy 、 Recall rate and F1
• ROC curve and AUC
• Cost matrix
• Performance metrics for regression problems
• Other evaluation indicators , Such as calculating speed 、 Robustness, etc

### 1. Performance metrics

Performance measurement refers to the evaluation criteria for measuring the generalization ability of the model .

#### 1.1 Accuracy and error rates

The two most commonly used performance metrics in classification problems – Accuracy and error rates .

Accuracy rate ： It refers to the proportion of the number of correctly classified samples in the total number of samples , The definition is as follows ：
A c c u r a c y = n c o r r e c t N Accuracy = \frac{n_{correct}}{N} Accuracy=Nncorrect
Error rate ： It refers to the proportion of samples with wrong classification in the total number of samples , The definition is as follows ：
E r r o r = n e r r o r N Error = \frac{n_{error}}{N} Error=Nnerror
The error rate is also the loss function 0-1 Error in loss .

These two evaluation criteria are the simplest and most intuitive evaluation indicators in classification problems . But they all have a problem , In case of category imbalance , None of them can effectively evaluate the generalization ability of the model . That is, if there is 99% The negative sample of , So when the model predicts that all samples are negative samples , You can get 99% The accuracy of .

This situation is When categories are unbalanced , The categories that account for a large proportion tend to be the most important factors affecting the accuracy

This time , One solution is to replace the evaluation index , For example, a more effective average accuracy ( Arithmetic mean of sample accuracy for each category ), namely ：
A m e a n = a 1 + a 2 + ⋯ + a m m A_{mean}=\frac{a_1+a_2+\dots+a_m}{m} Amean​=ma1​+a2​+⋯+am
among m It's the number of categories .

For accuracy and error rate , use Python The code implementation is shown in the figure below ：

def accuracy(y_true, y_pred):
return sum(y == y_p for y, y_p in zip(y_true, y_pred)) / len(y_true)
def error(y_true, y_pred):
return sum(y != y_p for y, y_p in zip(y_true, y_pred)) / len(y_true)


1.
2.
3.
4.
5.


A simple binary classification test example ：

y_true = [1, 0, 1, 0, 1]
y_pred = [0, 0, 1, 1, 0]
acc = accuracy(y_true, y_pred)
err = error(y_true, y_pred)
print('accuracy=', acc)
print('error=', err)


1.
2.
3.
4.
5.
6.
7.


The output is as follows ：

accuracy= 0.4
error= 0.6


1.
2.


#### 1.2 Accuracy 、 Recall rate 、P-R Curves and F1

##### 1.2.1 Accuracy and recall

Accuracy , Also known as precision , Refer to Of all the results predicted to be positive , The proportion of true positive classes . The formula is as follows ：
P = T P T P + F P P = \frac{TP}{TP+FP} P=TP+FPTP
Recall rate , Also known as recall , Of all positive classes , The proportion found by the classifier . The formula is as follows ：
R = T P T P + F N R = \frac{TP}{TP+FN} R=TP+FNTP
For the symbolic definition of the above two formulas , In the binary classification problem , We regard the category of interest as a positive class , Other categories as negative categories , therefore , Definition ：

• TP(True Positive)： The number of really positive classes , That is, it is classified as positive class , The number of samples that are actually positive classes ;
• FP(False Positive)： The number of false positive classes , That is, it is classified as positive class , But actually the number of samples of negative class ;
• FN(False Negative)： The number of false negative classes , That is, it is classified as negative class , But the actual number of positive samples ;
• TN(True Negative)： The number of true negative classes , That is, the classification is negative , The actual number of samples of negative classes .

A more vivid description , Please refer to the table below , It's also Confusion matrix The definition of ：

forecast ： Just like forecast ： Negative class
actual ： Just like TP FN
actual ： Negative class FP TN

Precision rate and recall rate are a pair of contradictory measures , Usually when the accuracy is high , Recall rates tend to be low ; And when the recall rate is high , The accuracy will be lower , Here's why ：

• The more accurate , It means that the proportion predicted to be positive is higher , And do that , It's usually Select only the samples you are sure of . The simplest thing is to select only the most confident sample , here FP=0,P=1, but FN It must be very big ( Those who are not sure are judged as negative ), The recall rate is very low ;
• The recall rate should be high , Just need to find all positive classes , To do that , The simplest is All categories are judged as positive , that FN=0 , but FP Also great , All the accuracy is very low .

And different problems , The evaluation indicators are also different , such as ：

• For recommendation systems , Focus on accuracy . That is, we hope that the recommended results are the results that users are interested in , That is, the proportion of information users are interested in is higher , Because there are usually limited windows for users , Generally, it can only show 5 individual , perhaps 10 individual , Therefore, it is more required to recommend information that users are really interested in ;
• For medical diagnostic systems , Focus on the recall rate . That is, we hope not to miss the detection of patients with any disease , If you miss the test , It may delay the patient's treatment , Cause the disease to worsen .

The code of accuracy rate and recall rate is simply implemented as follows , This is based on two categories

def precision(y_true, y_pred):
true_positive = sum(y and y_p for y, y_p in zip(y_true, y_pred))
predicted_positive = sum(y_pred)
return true_positive / predicted_positive
def recall(y_true, y_pred):
true_positive = sum(y and y_p for y, y_p in zip(y_true, y_pred))
real_positive = sum(y_true)
return true_positive / real_positive


1.
2.
3.
4.
5.
6.
7.
8.


A simple test sample and output are as follows

y_true = [1, 0, 1, 0, 1]
y_pred = [0, 0, 1, 1, 0]
precisions = precision(y_true, y_pred)
recalls = recall(y_true, y_pred)
print('precisions=', precisions) # Output is 0.5
print('recalls=', recalls) # Output is 0.3333


1.
2.
3.
4.
5.
6.
7.
8.

##### 1.2.2 P-R Curves and F1

A lot of times , We can sort the samples according to the prediction results of the classifier , The more advanced the classifier is, the more confident it is that it is the sample of positive class , The last nature is that the classifier thinks it is most unlikely to be a positive class sample .

Generally speaking , This prediction result is actually the confidence of the classifier in judging the sample as a category , We can choose different thresholds to adjust the output of the classifier to a sample , For example, setting the threshold is 0.9, Then only the confidence is greater than or equal to 0.9 The sample will finally be determined as positive , The rest are negative classes .

We set different thresholds , Naturally, we will get different numbers of positive classes and negative classes , Calculate the accuracy rate and recall rate of different cases in turn , And then we can Take the accuracy as the longitudinal axis , The recall rate is horizontal , Draw a “P-R curve ”, As shown in the figure below ：

Of course , The above curve is ideal , Future drawing is convenient and beautiful , The actual situation is shown in the figure below ：

about P-R curve , Yes ：

1. The curve starts from the upper left corner (0,1) To the lower right corner (1,0) The trend of , It just reflects that the accuracy rate and recall rate are a pair of contradictory measures , The characteristics of one high and another low ：

• At first, the accuracy is high , Because the setting threshold is very high , Only the first sample （ The classifier is most sure to be a positive class ） Is predicted to be a positive class , The others are negative classes , So the accuracy is high , Almost 1, And the recall rate is almost 0, Just find 1 A positive class .
• In the lower right corner, the recall rate is very high , The accuracy is very low , At this point, setting the threshold is 0, So all categories are predicted to be positive , All positive classes have been found , The recall rate is very high , And the accuracy is very low , Because a large number of negative classes are predicted to be positive .

2.P-R The curve can directly show the accuracy and recall of the classifier in the whole sample . So we can compare the performance of two classifiers on the same test set P-R Curve to compare their classification ability ：

• If the classifier B Of P-R The curve is classified by the classifier A The curve of completely covers , As shown in the figure on the left , Then we can say ,A Better performance than B;
• If it is the right picture below , The curves of the two intersect , It is difficult to directly judge the advantages and disadvantages of the two classifiers , The comparison can only be made according to the specific accuracy rate and recall rate ：
• A reasonable basis is Compare P-R The area under the curve , It represents the accuracy and recall of the classifier to a certain extent “ Double high ” The proportion of , But this value is not easy to calculate ;
• Another comparison is Balance point (Break-Event Point, BEP), It is Value when the accuracy rate is equal to the recall rate , As shown in the right figure below , And it can be determined that , Curves with farther equilibrium points are better .

Yes, of course , The balance point is still too simplified , Hence the F1 value This new evaluation standard , It is The harmonic mean of precision and recall , Defined as ：
F 1 = 2 × P × R P + R = 2 × T P sample Ben total Count + T P − T N F1 = \frac{2 \times P \times R}{P+R}=\frac{2\times TP}{ The total number of samples +TP-TN} F1=P+R2×P×R​= The total number of samples +TP−TN2×TP
F1 There is a more general form ： F β F_{\beta} Fβ​, It allows us to express different preferences for accuracy and recall , The definition is as follows ：
F β = ( 1 + β 2 ) × P × R ( β 2 × P ) + R F_{\beta}=\frac{(1+\beta^2)\times P\times R}{(\beta^2 \times P)+R} Fβ​=2×P)+R(1+β2)×P×R
among β &gt; 0 \beta &gt; 0 β>0 The relative importance of recall rate to accuracy rate is measured , When β = 1 \beta = 1 β=1, Namely F1; If β &gt; 1 \beta &gt; 1 β>1, Recall rate is more important ; If β &lt; 1 \beta &lt; 1 β<1, Accuracy is more important .

##### 1.2.3 Macro accuracy / Micro accuracy 、 Macro recall / Micro recall and macro F1 / tiny F1

A lot of times , We will get more than one binary confusion matrix , Like training many times / Multiple confusion matrices are obtained from the test , Training on multiple data sets / Test to estimate the accuracy of the algorithm “ overall situation ” performance , Or when performing multiple classification tasks, the categories are combined to obtain multiple confusion matrices .

All in all , We hope that n The accuracy rate and recall rate are comprehensively investigated on two classification confusion matrices . There are generally two ways to investigate ：

1. The first is directly in The accuracy rate and recall rate are calculated on each confusion matrix , Write it down as ( P 1 , R 1 ) , ( P 2 , R 2 ) , ⋯ &ThinSpace; , ( P n , R n ) (P_1, R_1), (P_2, R_2), \cdots, (P_n, R_n) (P1​,R1​),(P2​,R2​),⋯,(Pn​,Rn​), next Calculate average , You get the macro accuracy (macro-P)、 Macro recall (macro-R) And the macro F1(macro-F1) , The definition is as follows ：
m a c r o − P = 1 n ∑ i = 1 n P i , m a c r o − R = 1 n ∑ i = 1 n R i , m a c r o − F 1 = 2 × m a c r o − P × m a c r o − R m a r c o − P + m a c r o − R macro-P = \frac{1}{n}\sum_{i=1}^n P_i,\\ macro-R = \frac{1}{n}\sum_{i=1}^n R_i,\\ macro-F1 = \frac{2\times macro-P\times macro-R}{marco-P+macro-R} macro−P=n1i=1n​Pi​,macro−R=n1i=1n​Ri​,macro−F1=marco−P+macro−R2×macro−P×macro−R
2. The second is The corresponding elements of each confusion matrix are averaged , obtain TP、FP、TN、FN Average value , Based on these averages, we get the micro accuracy (micro-P)、 Micro recall (micro-R) And micro F1(micro-F1) , The definition is as follows ：
m i c r o − P = T P ‾ T P ‾ + F P ‾ , m i c r o − R = T P ‾ T P ‾ + F N ‾ , m i c r o − F 1 = 2 × m i c r o − P × m i c r o − R m i c r o − P + m i c r o − R micro-P = \frac{\overline{TP}}{\overline{TP}+\overline{FP}},\\ micro-R = \frac{\overline{TP}}{\overline{TP}+\overline{FN}},\\ micro-F1 = \frac{2\times micro-P\times micro-R}{micro-P + micro-R} micro−P=TP+FPTP​,micro−R=TP+FNTP​,micro−F1=micro−P+micro−R2×micro−P×micro−R

#### 1.3 ROC And AUC

##### 1.3.1 ROC curve

ROC Curved Receiver Operating Characteristic Short for curve , The Chinese name is “ Work characteristics of subjects ”, Originated in the military field , It has been widely used in the field of Medicine .

Its abscissa is The false positive rate is (False Positive Rate, FPR), The ordinate is True case rate (True Positive Rate, TPR), The definitions of the two are as follows ：
T P R = T P T P + F N , F P R = F P F P + T N TPR = \frac{TP}{TP+FN},\\ FPR = \frac{FP}{FP+TN} TPR=TP+FNTP​,FPR=FP+TNFP
TPR Express The probability of being predicted as a positive class by the classifier , Just equal to the recall rate of the positive class ;

FPR Express The probability of being predicted as a positive class by the classifier in a negative class , It's equal to 1 Minus the recall rate of the negative class , The recall rate of negative category is as follows , Called the true counterexample rate (True Negative Rate, TNR), Also known as specificity , Indicates the proportion of negative classes correctly classified .
T N R = T N F P + T N TNR =\frac{TN}{FP+TN} TNR=FP+TNTN

Follow P-R The drawing of the curve is the same ,ROC The curve actually passes through Constantly adjust the threshold for distinguishing positive and negative results To draw the , Its vertical axis is TPR, The horizontal axis is FPR, Here is a reference to 《 Baimian machine learning 》 To introduce , First, there is the table shown in the figure below , The table is an example of the output results of a binary classification model , contain 20 Samples , Then there is the corresponding real label , among p Indicates a positive category , and n Indicates a negative category . Then the output probability of the model represents the confidence of the model in judging that the sample is a positive class .

At first, if you set the threshold to infinity , Then the model will judge all samples as negative ,TP and FP Will be 0, That is to say TPR and FPR It must be 0,ROC The first coordinate of the curve is (0, 0). next , Threshold set to 0.9, At this time, the sample serial number is 1 The sample will be judged as a positive sample , And it's really a positive sample , that TP = 1, The number of positive samples is 10 individual , all TPR = 0.1; Then positive classes without prediction errors , namely FP = 0,FPR = 0, The second coordinate of the curve at this time is (0, 0.1).

By constantly adjusting the threshold , You can get different coordinates of the curve , Finally, as shown in the figure below ROC curve .

The second is to draw more intuitively ROC The method of curve , First, count the number of positive and negative samples , Let's assume that P and N, next , Set the scale interval of the horizontal axis to 1/N, The scale interval of the vertical axis is set to 1/P. Then the samples are sorted according to the probability of model output , And traverse the samples in order , Draw from zero ROC curve , Each time a positive sample is encountered, a curve of scale interval is drawn along the longitudinal axis , When a negative sample is encountered, draw a curve of scale interval along the horizontal axis , Until all the samples are traversed , The curve finally stops at (1,1) This point , It's done at this point ROC The drawing of the curve .

Of course , A more general ROC The curve is as shown in the figure below , It will be smoother , The above figure is due to the limited number of samples .

about ROC curve , It has the following characteristics ：

1.ROC The curve is usually from the lower left corner (0,0) Start , Go to the top right corner (1,1) end .

• At the beginning , The first sample is predicted to be a positive class , Others are predicted as negative categories ;
• TPR It's going to be very low , Almost 0, The above example is 0.1, At this time, a large number of positive classes are not found by the classifier ;
• FPR Very low , It could be 0, The above example is 0, At this time, the sample predicted to be a positive class may actually be a positive class , So there are few positive samples with prediction errors .
• At the end of the day , All samples are predicted to be positive .
• TPR Almost 1, Because all samples are predicted to be positive , Then you must find all the positive samples ;
• FPR Also almost for 1, Because all negative samples are misjudged as positive .

2.ROC In the curve ：

• The diagonal corresponds to the random conjecture model , That is, the probability is 0.5;
• spot (0,1) It's an ideal model , Because at this time TPR=1,FPR=0, That is, the positive classes are predicted , And no prediction errors ;
• Usually ,ROC The closer the curve is to the point (0, 1) The better .

3. It can also be based on ROC Curve to judge the performance of the two classifiers ：

• If classifier A Of ROC The curve is classified by the classifier B The curve of completely covers , so to speak B Performance is better than A, This corresponds to the previous article ROC The closer the curve is to the point (0, 1) The better ;
• If two classifiers ROC The curves intersect , It is also difficult to directly judge the performance of the two , Need help ROC The size of the area under the curve , And this area is called AUC:Area Under ROC Curve.

The simple code implementation is as follows ：

def true_negative_rate(y_true, y_pred):
true_negative = sum(1 - (yi or yi_hat) for yi, yi_hat in zip(y_true, y_pred))
actual_negative = len(y_true) - sum(y_true)
return true_negative / actual_negative
def roc(y, y_hat_prob):
thresholds = sorted(set(y_hat_prob), reverse=True)
ret = [[0, 0]]
for threshold in thresholds:
y_hat = [int(yi_hat_prob >= threshold) for yi_hat_prob in y_hat_prob]
ret.append([recall(y, y_hat), 1 - true_negative_rate(y, y_hat)])
return ret


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.


A simple test example is as follows ：

y_true = [1, 0, 1, 0, 1]
y_hat_prob = [0.9, 0.85, 0.8, 0.7, 0.6]
roc_list = roc(y_true, y_hat_prob)
print('roc_list:', roc_list)
# The output is roc_list: [[0, 0], [0.3333333333333333, 0.0], [0.3333333333333333, 0.5], [0.6666666666666666, 0.5], [0.6666666666666666, 1.0], [1.0, 1.0]]


1.
2.
3.
4.
5.
6.

##### 1.3.2 ROC and P-R Contrast of curves

The same thing

1. Both depict the impact of threshold selection on classification metrics . Although each classifier will output a probability for each sample , That's the confidence level , But usually we will artificially set a threshold to affect the final judgment result of the classifier , For example, set a high threshold –0.95, Or a lower threshold –0.3.

• If it is biased towards accuracy , Then increase the threshold , Ensure that only certain samples are judged as positive , At this point, you can set the threshold to 0.9, Or higher ;
• If you prefer recall , Then lower the threshold , Ensure that more samples are judged as positive , It's easier to find all the real positive samples , At this time, the setting threshold is 0.5, Or lower .

2. Each point of the two curves is Selection corresponding to a certain threshold , This point is below this threshold ( Accuracy , Recall rate ) / (TPR, FPR). Then, it corresponds to the decrease of the threshold along the horizontal axis .

Different

Comparison P-R curve ,ROC The curve has a characteristic , Namely When the distribution of positive and negative samples changes , Its curve shape can remain basically unchanged . As shown in the figure below :

After comparing the negative samples increased tenfold , P-R and ROC Changes in the curve , You can see ROC The shape of the curve is basically the same , but P-R The curve has changed dramatically .

therefore ROC This characteristic of the curve It can reduce the interference caused by different test sets , Evaluate the performance of the model itself more objectively , So it applies to more scenarios , Sorting for example 、 recommend 、 Advertising and other fields .

It's also because Many problems in the real scene will have an imbalance in the number of positive and negative samples The situation of , For example, the field of computational advertising often involves conversion rate models , The number of positive samples is often one thousandth or even one thousandth of the number of negative samples , Choose... At this time ROC The curve reflects the quality of the model itself .

Of course , If you want to see how the model behaves on a particular dataset ,P-R The curve will more directly reflect its performance . Therefore, we still need to analyze specific problems .

##### 1.3.3 AUC curve

AUC yes ROC The area of the curve , Its physical meaning is ： Select a sample randomly from all positive samples , The probability that the model predicts it as a positive sample is p 1 p_1 p1​; Select a sample randomly from all negative samples , The probability that the model predicts it as a positive sample is p 0 p_0 p0​. p 1 &gt; p 0 p_1 &gt; p_0 p1​>p0​ The probability of AUC.

AUC The curve has the following characteristics ：

• If the samples are classified completely randomly , that p 1 &gt; p 0 p_1 &gt; p_0 p1​>p0​ Is the probability that 0.5, be AUC=0.5;

• AUC It still applies when the sample is unbalanced .

Such as ： In the anti fraud scenario , Assume that the normal user is a positive class （ Design proportion 99.9%）, Fraudulent users are negative （ Design proportion 0.1%）.

If accuracy evaluation is used , Then all users can be predicted as positive classes to obtain 99.9% The accuracy of . Obviously, this is not a good prediction , Because of fraud, all users failed to find .

If you use AUC assessment , Now FPR=1,TPR=1, Corresponding AUC=0.5 . therefore AUC Successfully pointed out that this is not a good prediction result .

• AUC In response The sorting ability of the model for samples （ Sort according to the probability that the samples are predicted to be positive ）. Such as ：AUC=0.8 Express ： Given a positive sample and a negative sample , stay 80% Under the circumstances , The probability that the model predicts positive samples is greater than that of negative samples .

• AUC Insensitive to uniform sampling . Such as ： In the above anti fraud scenario , Assume uniform downsampling for normal users . Any given negative sample n, Let the probability that the model predicts it as a positive class be Pn . Before and after downsampling , Due to uniform sampling , Therefore, the probability of predicting a positive class is greater than Pn And less than Pn The proportion of real samples did not change . therefore AUC remain unchanged .

But if it is non-uniform downsampling , Then the probability of being predicted as a positive class is greater than Pn And less than Pn The proportion of real samples will change , This can also lead to AUC change .

• The greater the difference between positive and negative samples, the greater the probability of positive prediction , be AUC The higher the . Because this shows that the greater the certainty of sorting between positive and negative samples , The more distinguishable .

Such as ： In the e-commerce scene , Click through rate model AUC It is lower than that of purchasing transformation model AUC . Because the cost of click behavior is lower than that of purchase behavior , Therefore, the difference between positive and negative samples in the click through rate model is less than that in the purchase conversion model .

AUC Can be calculated by ROC The area of each part under the curve is summed to obtain . hypothesis ROC The curve is formed by connecting the following points with coordinates in order ：
( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ &ThinSpace; , ( x m , y m ) , Its in   x 1 = 0 , x m = 1 {(x_1,y_1),(x_2,y_2),\cdots,(x_m,y_m)}, among \ x_1=0, x_m=1 (x1​,y1​),(x2​,y2​),⋯,(xm​,ym​), among  x1​=0,xm​=1
that AUC It can be estimated that ：
A U C = 1 2 ∑ i = 1 m − 1 ( x i + 1 − x i ) × ( y i + y i + 1 ) AUC = \frac{1}{2}\sum_{i=1}^{m-1}(x_{i+1}-x_i)\times (y_i+y_{i+1}) AUC=21i=1m−1​(xi+1​−xi​)×(yi​+yi+1​)

The code implementation is as follows ：

def get_auc(y, y_hat_prob):
roc_val = iter(roc(y, y_hat_prob))
tpr_pre, fpr_pre = next(roc_val)
auc = 0
for tpr, fpr in roc_val:
auc += (tpr + tpr_pre) * (fpr - fpr_pre) / 2
tpr_pre = tpr
fpr_pre = fpr
return auc


1.
2.
3.
4.
5.
6.
7.
8.
9.


A simple test example is as follows ：

y_true = [1, 0, 1, 0, 1]
y_hat_prob = [0.9, 0.85, 0.8, 0.7, 0.6]
auc_val = get_auc(y_true, y_hat_prob)
print('auc_val:', auc_val) # The output is 0.5


1.
2.
3.
4.
5.


#### 1.4 Cost matrix

The performance metrics described above have an implicit premise , Mistakes are Equal price . But in practice , The consequences of different types of errors are different . For example, judge a healthy person as a patient , And the patient is judged to be healthy , The price must be different , The former may just need to be checked again , The latter may miss the best time for treatment .

therefore , In order to measure the different losses caused by different types , Errors can be given Unequal costs (unequal cost).

For a class II classification problem , You can set a Cost matrix (cost matrix), among c o s t i j cost_{ij} costij​ It means that the i The class sample is predicted to be the j The cost of class samples , And the price of correct prediction is 0 . As shown in the following table ：

forecast ： The first 0 class forecast ： The first 1 class
real ： The first 0 class 0 c o s t 01 cost_{01} cost01
real ： The first 1 class c o s t 10 cost_{10} cost10 0
1. At unequal costs , What we hope to find is no longer a model that simply minimizes the error rate , But hoping to find Minimize the overall cost total cost Model of .

2. At unequal costs ,ROC The curve can not directly reflect the expected overall cost of the classifier , In this case, you need to use the cost curve cost curve

• The horizontal axis of the cost curve is Positive example probability cost , As shown below , among p It's a case in point ( The first 0 class ) Probability

P + c o s t = p × c o s t 01 p × c o s t 01 + ( 1 − p ) × c o s t 10 P_{+cost} = \frac{p\times cost_{01}}{p\times cost_{01}+(1-p)\times cost_{10}} P+cost​=p×cost01​+(1−p)×cost10p×cost01

• The vertical axis of the cost curve is the normalized cost , As shown below ：
c o s t n o r m = F N R × p × c o s t 01 + F P R × ( 1 − p ) × c o s t 10 p × c o s t 01 + ( 1 − p ) × c o s t 10 cost_{norm} = \frac{FNR\times p\times cost_{01}+FPR\times (1-p)\times cost_{10}}{p\times cost_{01}+(1-p)\times cost_{10}} costnorm​=p×cost01​+(1−p)×cost10FNR×p×cost01​+FPR×(1−p)×cost10
among , The false positive rate is FPR Indicates the probability that the model predicts a negative sample as a positive class , The definition is as follows ：
F P R = F P T N + F P FPR = \frac{FP}{TN+FP} FPR=TN+FPFP
False negative case rate FNR Indicates the probability of predicting a positive sample as a negative class , The definition is as follows ：
F N R = 1 − T P R = F N T P + F N FNR = 1 - TPR = \frac{FN}{TP+FN} FNR=1−TPR=TP+FNFN
The cost curve is shown in the figure below ：

#### 1.5 Performance metrics for regression problems

For the return question , Common performance metrics are ：

1. Mean square error (Mean Square Error, MSE), The definition is as follows ：
M S E = 1 N ∑ i = 1 N ( y i − y i ^ ) 2 MSE=\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y_i})^2 MSE=N1i=1N​(yi​−yi^​)2
2. Root mean square error (Root Mean Squared Error, RMSE), The definition is as follows ：
R M S E = 1 N ∑ i = 1 N ( y i − y i ^ ) 2 RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y_i})^2} RMSE=N1i=1N​(yi​−yi^​)2
3. Root mean square logarithm error (Root Mean Squared Logarithmic Error, RMSLE), The definition is as follows
R M S L E = 1 N ∑ i = 1 N [ l o g ( y i + 1 ) − l o g ( y i ^ + 1 ) ] 2 RMSLE=\sqrt{\frac{1}{N}\sum_{i=1}^N[log(y_i+1)- log(\hat{y_i}+1)]^2} RMSLE=N1i=1N​[log(yi​+1)−log(yi^​+1)]2
4. Mean absolute error (Mean Absolute Error, MAE), The definition is as follows ：
M A E = 1 N ∑ i = 1 N ∣ y i − y i ^ ∣ MAE = \frac{1}{N}\sum_{i=1}^N |y_i-\hat{y_i}| MAE=N1i=1N​∣yi​−yi^​∣
Of the four criteria , The first and second , namely MSE and RMSE, These two standards can generally well reflect the deviation between the predicted value and the real value of the regression model , But if you meet Individual outliers with very large deviation when , Even if the number is small , It also makes these two indicators very poor .

In this case , There are three solutions ：

• Outliers are treated as noise points , That is, the data preprocessing part needs to filter out these noise points ;
• Start with model performance , Improve the prediction ability of the model , The mechanism of these outliers is modeled into the model , But this method will be more difficult ;
• Use other indicators , For example, the third indicator RMSLE, It focuses on the proportion of prediction error , Even if there are outliers , It can also reduce the impact of these outliers ; Or is it MAPE, Mean absolute percentage error (Mean Absolute Percent Error), Defined as ：

M A P E = ∑ i = 1 n ∣ y i − y i ^ y i ∣ × 100 n MAPE = \sum_{i=1}^n |\frac{y_i-\hat{y_i}}{y_i}|\times\frac{100}{n} MAPE=i=1n​∣yiyi​−yi^​∣×n100

RMSE The simple code implementation of is as follows ：

def rmse(predictions, targets):
# The error between the real value and the predicted value
differences = predictions - targets
differences_squared = differences ** 2
mean_of_differences_squared = differences_squared.mean()
# Square root
rmse_val = np.sqrt(mean_of_differences_squared)
return rmse_val


1.
2.
3.
4.
5.
6.
7.
8.


#### 1.6 Other evaluation indicators

1. Calculation speed ： Time required for model training and prediction ;
2. Robustness ： The ability to handle missing and outliers ;
3. Expansibility ： The ability to handle big data sets ;
4. Interpretability ： Comprehensibility of model prediction criteria , For example, the rules generated by the decision tree are easy to understand , The reason why neural network is called black box is that its large number of parameters are not easy to understand .

### Summary

This paper mainly introduces several performance evaluations of classification problems based on binary classification problems , They are very common evaluation indicators , Generally, these methods are mainly used to evaluate the performance of the model in practical application .

Reference resources ：

• 《 machine learning 》– zhou
• 《 Baimian machine learning 》
• 《hands-on-ml-with-sklearn-and-tf》
• 9. Model to evaluate
• The method of classification model evaluation and Python Realization

Welcome to my WeChat official account. – Machine learning and computer vision , Or scan the QR code below , Let's talk , Learning and progress ！