Address of thesis ：https://arxiv.org/abs/2109.03144
Project address ：https://github.com/PaddlePaddle/PaddleOCR
In terms of effect ,PP-OCRv2 There are three main aspects to improve ：
In terms of model effect , be relative to PP-OCR mobile The version has been upgraded by more than 7%;
In speed , be relative to PP-OCR server The version has been upgraded by more than 220%;
In terms of model size ,11.6M The total size of , Both the server side and the mobile side can be easily deployed .
In order to let readers know more technical details , Flying propeller PaddleOCR Original team for PP-OCRv2 A more in-depth and exclusive interpretation , I hope it can be helpful to everyone's work and study .
PP-OCRv2 In depth interpretation of the five key technical improvement points
New and upgraded PP-OCRv2 edition , The overall frame diagram maintains the relationship with PP-OCR same Pipeline, As shown in the figure below ：
In terms of optimization strategy , It is deeply optimized from five angles （ As shown in the red box above ）, It mainly includes ：
The detection part optimizes two items ：
Adopt collaborative mutual learning （Collaborative Mutual Learning, CML) Knowledge distillation strategy
CopyPaste Data augmentation strategy
The identification part optimizes three items ：
LCNet Lightweight backbone network （Lightweight CPU Network）
UDML Knowledge distillation strategy
Enhanced CTC loss improvement
Let's introduce in detail .
Detection model optimization ： use CML Knowledge distillation strategy
As shown in the figure below , The standard distillation method is through a large model as Teacher Model to guide Student The model improves the effect , And later developed DML Learn distillation methods from each other , That is, two models with the same structure learn from each other . Both algorithms are between two models , And the latest is PP-OCRv2 The relationship between the three models is used in CML Synergistic mutual distillation method , Both contain two of the same structure Student Models learn from each other , At the same time, the concept of larger model structure is introduced Teacher Model .
such ,CML The core idea of ① Conventional Teacher To guide the Student Standard distillation and ②Students Between networks DML Mutual learning , In fact, this idea is similar to the model of encouraging everyone to discuss with each other and teacher guidance in the actual school , It can make Students The effect of learning is better .
The specific network structure and careful design are the three key Loss The loss function is shown in the figure below ：
① DML Loss： For an input training picture , Send them to two Student The Internet , What we use here is DBNet Test model , Output the corresponding probability diagram （response maps）, Then compare the... Between the two networks DML loss, The divergence method is used here , The corresponding formula is as follows , among S1 and S2 Corresponding to the two Student The Internet ,KL Is the divergence formula ：
② GT Loss： The standard DBNet The training task is shown in the figure below ：
Its output mainly includes the above three feature map, The details are shown in the table below ：
GT Loss It can be expressed as ：
③ Distill Loss： The third part is from teacher The supervisory signals of , It is only for the characteristics Probability map Binary map Do distillation . in addition , From practical experience , Yes Teacher The output of is expanded (f_dila()), Can enhance Teacher The ability to express , promote Teacher The accuracy is about 2%, So as to improve the distillation effect , The corresponding function can be expressed as ：
among Yes, the default hyperparameter is set to 5, They are cross entropy loss and Dice Loss, Is the expansion function .
④ Final , be-all Loss The functions add up , It's the ultimate Loss function , As shown below ：
here , About three Loss Function weight distribution , You can also adjust or learn some super parameters , Interested developers can continue to try .
Detection model optimization ：CopyPaste Data augmentation strategy
In the actual detection model training process , There are always two problems ：① Insufficient sample richness , It is mainly reflected in the high cost of labeling a large amount of data , Moreover, there are also requirements for the rich diversity of collection process and collection ; ② The model has poor robustness to the environment , The same text distribution , The detection results under different backgrounds are quite different .
In this way, it is easy to think of using data augmentation as an important means to improve the generalization ability of the model ,CopyPaste Is a novel data enhancement technique , The effectiveness has been verified in target detection and instance segmentation tasks . utilize CopyPaste, Text instances can be synthesized to balance the proportion between positive and negative samples in the training image .
Compared with , Traditional image rotation 、 Random flipping and random clipping cannot be done .CopyPaste The main steps include ：
Randomly select two training images ;
Random scale jitter scaling ;
Random horizontal flip ;
Randomly select the target subset in an image ;
Paste in a random position in another image .
In this way, the sample richness is better improved , At the same time, it also increases the robustness of the model to the environment . As shown in the figure below , Through the text cut out in the second figure , Randomly rotate, zoom, and paste into the first image , It further enriches the diversity of the text under different backgrounds .
After the optimization strategy of the above two detection directions ,PP-OCRv2 The experimental results of the detection part are shown in the table below ：
You can see , Since distillation and data enhancement do not affect the structure of the model , Therefore, the overall model size and prediction speed have not changed , But in terms of accuracy , With Hmean Calculation , Added 3.6%, It's still quite obvious .
Identification model optimization ： Since the research LCNet Lightweight backbone network
here ,PP-OCRv2 The R & D team proposed a method based on MobileNetV1 Improved new backbone network LCNet, Major changes include ：
except SE modular , All the... In the network relu Replace with h-swish, Accuracy improvement 1%-2%;
LCNet The fifth stage ,DW Of kernel size Turn into 5x5, Accuracy improvement 0.5%-1%;
LCNet The last two of the fifth stage depthSepconv block add to SE modular , Accuracy improvement 0.5%-1%;
GAP Add 1280 Dimensional FC layer , Increase the ability of feature expression , Accuracy improvement 2%-3%.
stay ImageNet-1k Validation on datasets , You can see ,LCNet Not only is it far ahead of other lightweight backbone networks in accuracy , At the same time CPU The prediction speed has also achieved obvious advantages .
Identification model optimization ：UDML Knowledge distillation strategy
In standard DML On the basis of knowledge distillation , New introduction to Feature Map The monitoring mechanism , newly added Feature Loss, Then combine CRNN The basis of CTC Loss And distillation DML Loss, Finally, calculate separately, and then sum , be-all Loss The function can be expressed as ：
in addition , In the training process, by increasing the number of iterations , stay Head Partially add FC Network, etc trick, Balance the ability of model feature coding and decoding , The model effect is further improved .
Identification model optimization ：Enhanced CTC loss improvement
Considering Chinese OCR The recognition difficulty often encountered in tasks is that there are too many similar characters , It's easy to misunderstand . So to learn from Metric Learning Ideas , introduce Center Loss, Further increase the distance between classes , The core idea is shown in the formula below ：
meanwhile ,Enhance-CTC The initialization process also has a great impact on the results , It mainly includes ：
Based on the original CTCLoss, Train a network ;
Extract the correct image set in the training set , Counted as G;
take G Enter the pictures in the network in turn , extract 80 individual timestamp Of xt and yt Correspondence of , among yt The calculation method is as follows ：
Will be the same yt Corresponding xt Come together , Take the average , As a starting point center.
such , After the optimization strategy of the above three detection directions ,PP-OCRv2 The experimental results of the detection part are as follows ：
You can see , After a series of optimizations , Although the size of the recognition model has increased , However, the overall accuracy and speed have been significantly improved .
also , After optimization in the above five directions , Final PP-OCRv2 At the cost of a small increase in model size , To surpass in all respects PP-OCR, Good results have been achieved , As shown in the figure below ：
In the industrial model library of propeller PP Series model
except PP-OCRv2, In order to better meet the actual industrial needs of enterprise users , The oars have been released PP Series model , It can meet the needs of users in item identification 、 object detection 、 Portrait segmentation 、 Text recognition 、 Image hypersegmentation 、 A series of tasks such as natural language processing . Limited space , The summary is as follows , Users in need can go to Github Learn more about ：
（1）PP-ShiTu（ Upcoming release ）： Ultra lightweight general image recognition system , Effectively solve the problem of product identification 、 Recognition problems in scenes such as vehicle recognition , And can quickly migrate to other image recognition tasks .
（2）PP-YOLOv2： Target detection task AP And speed YOLOv5, One version can reach 50.3%AP and 50.3FPS.
（3）PP-HumanSeg： Lightweight portrait segmentation , The accuracy and speed of the mobile terminal are excellent .
（4）PP-Structure： Support document layout analysis 、 Structured and table recognition , It is the richest document analysis solution in the industry .
（5）PP-VSR： Video super score SOTA Model ,REDS、Vid4 The accuracy of the data set is optimal , The effect is amazing .
Strong support behind quickly building star model ：
The design concept oriented to the needs of industrial users and the development experience of the dynamic and static unity of the propeller
From years of industrial practice , The propeller has a deep insight into the needs of enterprises for industrial models , The core is the following two points ：
Good model effect , Meet business needs ;
Forecast deployment is convenient , Fast , Save resources .
These two requirements correspond to the programming mode of the depth framework , The difference is Dynamic graph programming （ Debug is convenient ） and Static diagram deployment （ Good performance ）. Here is an introduction ： Dynamic graph and static graph are two common patterns of deep learning framework .
In dynamic graph mode , The code is written and run in accordance with Python Programmer's habits , Easy to debug , But in terms of performance , Python The execution cost is high , And C++ There is a certain gap . Compared to dynamic graphs , Static diagrams have more performance advantages in deployment . Static graph program is compiled and executed , A pre built neural network can break away from Python rely on , stay C++ The end is parsed and executed again , And with the overall network structure, we can also optimize the network structure .
During the design of the core frame of the propeller , It is also keen to grasp the real needs of users , It provides the ability of dynamic and static unity , Support users to write networking code using dynamic diagram . When predicting deployment , Through the dynamic and static conversion interface , Will analyze user code , Automatically convert to static graph network structure , It takes into account the advantages of dynamic graph ease of use and static graph deployment performance .