Effect increased by 7% and speed increased by 220%. OCR open source artifact paddleocr welcomes another upgrade

AI Vision Network 2021-10-14 04:57:47
  • Address of thesis :https://arxiv.org/abs/2109.03144

  • Project address :https://github.com/PaddlePaddle/PaddleOCR

In terms of effect ,PP-OCRv2 There are three main aspects to improve :

  • In terms of model effect , be relative to PP-OCR mobile The version has been upgraded by more than 7%;

  • In speed , be relative to PP-OCR server The version has been upgraded by more than 220%;

  • In terms of model size ,11.6M The total size of , Both the server side and the mobile side can be easily deployed .

 picture

In order to let readers know more technical details , Flying propeller PaddleOCR Original team for PP-OCRv2 A more in-depth and exclusive interpretation , I hope it can be helpful to everyone's work and study .

PP-OCRv2 In depth interpretation of the five key technical improvement points

New and upgraded PP-OCRv2 edition , The overall frame diagram maintains the relationship with PP-OCR same Pipeline, As shown in the figure below :

 picture

In terms of optimization strategy , It is deeply optimized from five angles ( As shown in the red box above ), It mainly includes :

The detection part optimizes two items :

  • Adopt collaborative mutual learning (Collaborative Mutual Learning, CML) Knowledge distillation strategy

  • CopyPaste Data augmentation strategy

The identification part optimizes three items :

  • LCNet Lightweight backbone network (Lightweight CPU Network)

  • UDML Knowledge distillation strategy

  • Enhanced CTC loss improvement

Let's introduce in detail .

Detection model optimization : use CML Knowledge distillation strategy

As shown in the figure below , The standard distillation method is through a large model as Teacher Model to guide Student The model improves the effect , And later developed DML Learn distillation methods from each other , That is, two models with the same structure learn from each other . Both algorithms are between two models , And the latest is PP-OCRv2 The relationship between the three models is used in CML Synergistic mutual distillation method , Both contain two of the same structure Student Models learn from each other , At the same time, the concept of larger model structure is introduced Teacher Model .

 picture

such ,CML The core idea of ① Conventional Teacher To guide the Student Standard distillation and ②Students Between networks DML Mutual learning , In fact, this idea is similar to the model of encouraging everyone to discuss with each other and teacher guidance in the actual school , It can make Students The effect of learning is better .

The specific network structure and careful design are the three key Loss The loss function is shown in the figure below :

 picture

DML Loss: For an input training picture , Send them to two Student The Internet , What we use here is DBNet Test model , Output the corresponding probability diagram (response maps), Then compare the... Between the two networks DML loss, The divergence method is used here , The corresponding formula is as follows , among S1 and S2 Corresponding to the two Student The Internet ,KL Is the divergence formula :

 picture

GT Loss: The standard DBNet The training task is shown in the figure below :

 picture

Its output mainly includes the above three feature map, The details are shown in the table below :

 picture

GT Loss It can be expressed as :

 picture

Distill Loss: The third part is from teacher The supervisory signals of , It is only for the characteristics Probability map Binary map Do distillation . in addition , From practical experience , Yes Teacher The output of is expanded (f_dila()), Can enhance Teacher The ability to express , promote Teacher The accuracy is about 2%, So as to improve the distillation effect , The corresponding function can be expressed as :

 picture

among   Yes, the default hyperparameter is set to 5, They are cross entropy loss and Dice Loss, Is the expansion function .

④ Final , be-all Loss The functions add up , It's the ultimate Loss function , As shown below :

 picture

here , About three Loss Function weight distribution , You can also adjust or learn some super parameters , Interested developers can continue to try .

Detection model optimization :CopyPaste Data augmentation strategy

In the actual detection model training process , There are always two problems :① Insufficient sample richness , It is mainly reflected in the high cost of labeling a large amount of data , Moreover, there are also requirements for the rich diversity of collection process and collection ; ② The model has poor robustness to the environment , The same text distribution , The detection results under different backgrounds are quite different .

In this way, it is easy to think of using data augmentation as an important means to improve the generalization ability of the model ,CopyPaste Is a novel data enhancement technique , The effectiveness has been verified in target detection and instance segmentation tasks . utilize CopyPaste, Text instances can be synthesized to balance the proportion between positive and negative samples in the training image .

 picture

Compared with , Traditional image rotation 、 Random flipping and random clipping cannot be done .CopyPaste The main steps include :

  1. Randomly select two training images ;

  2. Random scale jitter scaling ;

  3. Random horizontal flip ;

  4. Randomly select the target subset in an image ;

  5. Paste in a random position in another image .

In this way, the sample richness is better improved , At the same time, it also increases the robustness of the model to the environment . As shown in the figure below , Through the text cut out in the second figure , Randomly rotate, zoom, and paste into the first image , It further enriches the diversity of the text under different backgrounds .

 picture

After the optimization strategy of the above two detection directions ,PP-OCRv2 The experimental results of the detection part are shown in the table below :

 picture

You can see , Since distillation and data enhancement do not affect the structure of the model , Therefore, the overall model size and prediction speed have not changed , But in terms of accuracy , With Hmean Calculation , Added 3.6%, It's still quite obvious .

Identification model optimization : Since the research LCNet Lightweight backbone network

 picture

here ,PP-OCRv2 The R & D team proposed a method based on MobileNetV1 Improved new backbone network LCNet, Major changes include :

  • except SE modular , All the... In the network relu Replace with h-swish, Accuracy improvement 1%-2%;

  • LCNet The fifth stage ,DW Of kernel size Turn into 5x5, Accuracy improvement 0.5%-1%;

  • LCNet The last two of the fifth stage depthSepconv block add to SE modular , Accuracy improvement 0.5%-1%;

  • GAP Add 1280 Dimensional FC layer , Increase the ability of feature expression , Accuracy improvement 2%-3%.

stay ImageNet-1k Validation on datasets , You can see ,LCNet Not only is it far ahead of other lightweight backbone networks in accuracy , At the same time CPU The prediction speed has also achieved obvious advantages .

 picture

Identification model optimization :UDML Knowledge distillation strategy

 picture

In standard DML On the basis of knowledge distillation , New introduction to Feature Map The monitoring mechanism , newly added Feature Loss, Then combine CRNN The basis of CTC Loss And distillation DML Loss, Finally, calculate separately, and then sum , be-all Loss The function can be expressed as :

 picture

in addition , In the training process, by increasing the number of iterations , stay Head Partially add FC Network, etc trick, Balance the ability of model feature coding and decoding , The model effect is further improved .

Identification model optimization :Enhanced CTC loss improvement

Considering Chinese OCR The recognition difficulty often encountered in tasks is that there are too many similar characters , It's easy to misunderstand . So to learn from Metric Learning Ideas , introduce Center Loss, Further increase the distance between classes , The core idea is shown in the formula below :

 picture

meanwhile ,Enhance-CTC The initialization process also has a great impact on the results , It mainly includes :

  • Based on the original CTCLoss, Train a network ;

  • Extract the correct image set in the training set , Counted as G;

  • take G Enter the pictures in the network in turn , extract 80 individual timestamp Of xt and yt Correspondence of , among yt The calculation method is as follows :

 picture

  • Will be the same yt Corresponding xt Come together , Take the average , As a starting point center.

such , After the optimization strategy of the above three detection directions ,PP-OCRv2 The experimental results of the detection part are as follows :

 picture

You can see , After a series of optimizations , Although the size of the recognition model has increased , However, the overall accuracy and speed have been significantly improved .

also , After optimization in the above five directions , Final PP-OCRv2 At the cost of a small increase in model size , To surpass in all respects PP-OCR, Good results have been achieved , As shown in the figure below :

 picture

In the industrial model library of propeller PP Series model

except PP-OCRv2, In order to better meet the actual industrial needs of enterprise users , The oars have been released PP Series model , It can meet the needs of users in item identification 、 object detection 、 Portrait segmentation 、 Text recognition 、 Image hypersegmentation 、 A series of tasks such as natural language processing . Limited space , The summary is as follows , Users in need can go to Github Learn more about :

(1)PP-ShiTu( Upcoming release ): Ultra lightweight general image recognition system , Effectively solve the problem of product identification 、 Recognition problems in scenes such as vehicle recognition , And can quickly migrate to other image recognition tasks .

 picture

(2)PP-YOLOv2: Target detection task AP And speed YOLOv5, One version can reach 50.3%AP and 50.3FPS.

 picture

(3)PP-HumanSeg: Lightweight portrait segmentation , The accuracy and speed of the mobile terminal are excellent .

 picture

(4)PP-Structure: Support document layout analysis 、 Structured and table recognition , It is the richest document analysis solution in the industry .

 picture

(5)PP-VSR: Video super score SOTA Model ,REDS、Vid4 The accuracy of the data set is optimal , The effect is amazing .

 picture

Strong support behind quickly building star model :

The design concept oriented to the needs of industrial users and the development experience of the dynamic and static unity of the propeller

From years of industrial practice , The propeller has a deep insight into the needs of enterprises for industrial models , The core is the following two points :

  • Good model effect , Meet business needs ;

  • Forecast deployment is convenient , Fast , Save resources .

These two requirements correspond to the programming mode of the depth framework , The difference is Dynamic graph programming ( Debug is convenient ) and Static diagram deployment ( Good performance ). Here is an introduction : Dynamic graph and static graph are two common patterns of deep learning framework .

In dynamic graph mode , The code is written and run in accordance with Python Programmer's habits , Easy to debug , But in terms of performance , Python The execution cost is high , And C++ There is a certain gap . Compared to dynamic graphs , Static diagrams have more performance advantages in deployment . Static graph program is compiled and executed , A pre built neural network can break away from Python rely on , stay C++ The end is parsed and executed again , And with the overall network structure, we can also optimize the network structure .

 picture

During the design of the core frame of the propeller , It is also keen to grasp the real needs of users , It provides the ability of dynamic and static unity , Support users to write networking code using dynamic diagram . When predicting deployment , Through the dynamic and static conversion interface , Will analyze user code , Automatically convert to static graph network structure , It takes into account the advantages of dynamic graph ease of use and static graph deployment performance .

Please bring the original link to reprint ,thank
Similar articles

2021-10-14