Cbnetv2: the composite backbone network proposed by Peking University, with a coco score of 60.1%

AI Vision Network 2021-10-14 04:54:44

CBNetV2: A Composite Backbone Network Architecture for Object Detection


Address of thesis :2107.00420.pdf (


Modern object detectors with the best performance depend largely on backbone networks , Its progress has led to consistent performance improvements by exploring more effective network structures . In this paper , We propose a novel and flexible backbone framework , namely CBNetV2, In order to use the existing open source pre training backbone to build a high-performance detector under the pre training fine-tuning paradigm . especially ,CBNetV2 The architecture groups multiple identical backbones , These backbones are connected through composite connections . say concretely , It integrates the high-level and low-level characteristics of multiple backbone networks , And gradually expand the receptive field to detect the target more efficiently . We are also based on CBNet A better training strategy with auxiliary supervision is proposed .CBNetV2 It has strong generalization ability for different trunk and head designs of detector architecture . No additional pre training is required for the composite trunk ,CBNetV2 Can adapt to a variety of backbone ( namely , be based on CNN And Transformer-based) And the head design of most mainstream detectors ( namely , Level 1 and level 2 , Anchor based and . be based on anchorfree). The experiment provides strong evidence , Compared with simply increasing the depth and width of the network ,CBNetV2 Introduced a more efficient 、 A more efficient and resource-friendly way to build high-performance backbone networks . especially , our DualSwin-L Under single model and single scale test protocol COCO test-dev It has been realized. 59.4% Of box AP and 51.6% Of mask AP, This is significantly better than the most advanced results ( namely , Swin-L Realized 57.7% Of box AP and 50.2% Of mask AP), And training programs are reduced 6 times . Through multi-scale test , We don't use additional training data , The first mock exam results are pushed to the present. 60.1% Of box AP and 52.3% Of mask AP New record of . The code can be found in get .

1、 introduction

Object detection is one of the basic problems of computer vision , Serve automatic driving 、 Intelligent video monitoring 、 Remote sensing and other applications . In recent years , Due to the vigorous development of object detection , Great progress has been made in object detection Deep convolution network [2], And excellent detectors have been proposed , for example SSD [3]、YOLO [4]、Faster R-CNN [5]、RetinaNet [6]、ATSS [7]、Mask R-CNN [8] ]、Cascade R-CNN [9] etc. .

Usually , Based on Neural Network (NN) In the detector , Backbone network is used to extract the basic features of detection objects , Usually originally designed for image classification and ImageNet Pre-training on the data set [10]. Intuitively speaking , The more representative features extracted from the trunk , The better the performance of its host detector . For higher accuracy , Mainstream detector ( Move size model from [11]、[12] and ResNet [13], To ResNeXt [14] and Res2Net [15]) Used deeper 、 Wider trunk . lately , Based on Transformer [16] The backbone of , And shows very promising performance . Overall speaking , The progress of large trunk pre training shows the trend of more effective and efficient multi-scale representation in object detection .

Encouraged by the results of pre trained large backbone based detectors , We seek further improvement , By using the existing well-designed backbone architecture and its pre training weight, a high-performance detector is constructed . Although one can design a new and improved trunk , But the cost of expertise and computing resources can be expensive . One side , Designing a new backbone architecture requires expert experience and a lot of trial and error . On the other hand , stay ImageNet Pre train a new trunk ( Especially for large models ) It needs a lot of computing resources , This makes the cost of obtaining better detection performance after pre training and fine-tuning paradigm very high . perhaps , Training the detector from scratch can save the cost of pre training , But more computing resources and training skills are needed to train the detector [17].



In this paper , We propose a simple and novel combination method , The existing pre training backbone can be used in the pre training fine tuning paradigm . Unlike most previously focused on modular production and need to ImageNet There are different ways of pre training to strengthen the representation , We improved the existing trunk representation without additional pre training . Pictured 1 Shown , Our solution is called Composite Backbone Network V2 (CBNetV2), Group multiple identical trunks together . say concretely , Parallel backbone ( It is called secondary backbone and boot backbone ) Connected by composite connection . In the figure 1 Center left to right , The output of each stage in the secondary backbone flows to the parallel and lower level stages of its subsequent brothers . Last , Feed the features of the trunk to the neck and detection head , For bounding box regression and classification . Contrary to simple network deepening or broadening ,CBNetV2 It integrates the high-level and low-level characteristics of multiple backbone networks , And gradually expand the receptive field to achieve more efficient target detection . It is worth noting that ,CBNetV2 Each assembled trunk is initialized by the weight of an existing open source pre trained single trunk ( for example ,Dual-ResNet50 1 from ResNet50 [13] Weight initialization of , Available in the open source community ). Besides , In order to further play CBNetV2 The potential of , We propose an effective training strategy , Supervision with auxiliary backbone , Without sacrificing the reasoning speed, it is better than the original CBNet [1] Higher detection accuracy .

We passed in challenging MS COCO The benchmark [18] Experiments are carried out to prove the effectiveness of our framework . Experiments show that ,CBNetV2 It has strong generalization ability for different trunk and head designs of detector architecture , This allows us to train detectors that are significantly better than detectors based on larger trunks . say concretely ,CBNetV2 It can be applied to various backbone ( for example , From convolution based [13]、[14]、[15] Based on the Transformer Of [19]). Compared with the original trunk , our DualBackbone Improved its performance 3.4%∼3.5% AP, It is proved that CBNetV2 The effectiveness of the . Under comparable model complexity , our Dual-Backbone Still improved 1.1% ∼ 2.1% AP, It shows that the combined backbone network is more effective than the wider and deeper network pre trained . Besides ,CBNetV2 It can be flexibly inserted into the mainstream detector ( for example ,RetinaNet [6]、ATSS [7]、Faster R-CNN [5]、Mask R-CNN [8]、Cascade R-CNN and Cascade Mask R-CNN [ 9]), And continuously improve the performance of these detectors 3%∼3.8% AP, It is proved that it has strong adaptability to the design of various detector heads . It is worth noting that , our CBNetV2 A general and resource-friendly framework is proposed to promote the accuracy of high-performance detectors . No fancy , our Dual-SwinL stay COCO test-dev It achieves unparalleled single model and single scale results 59.4% box AP and 51.6% mask AP, More than the most advanced results ( namely 57.7% box AP and Swin-L To obtain the 50.2% mask AP), At the same time, the training program is reduced 6 times . Through multi-scale test , We push the best single model results to 60.1% box AP and 52.3% mask AP New record of .

The main contributions of this paper are as follows :

• We propose a general 、 Efficient and effective framework CBNetV2( Composite backbone network V2), To build a high-performance backbone network for object detection , No additional pre training is required .

• We propose dense advanced portfolio (DHLC) Style and auxiliary supervision , In order to more effectively use the existing pre training weight for object detection under the pre training fine-tuning paradigm .

• our Dual-Swin-L In comparison Swin-L shorter (6 times ) Under your training program , stay COCO A new record of single model and single scale results is realized . Through multi-scale test , Our method can obtain the best known results without additional training data .

2、 Related work

** Object detection .** Object detection aims to locate each object instance from a set of predefined classes in the input image . With convolution neural networks (CNN) Rapid development of , There is a popular paradigm of target detector based on deep learning : Backbone network ( Usually designed for classification and ImageNet Pre training on ) Extracting basic features from input images , Then neck ( for example , Characteristic pyramid network [21]) Enhance multiscale features from the trunk , Then, the detection head uses the position and classification information to predict the object bounding box . Based on detection head , The frontier methods of general object detection can be briefly divided into two main branches . The first branch contains a single-stage detector , for example YOLO [4]、SSD [3]、RetinaNet [6]、NAS-FPN [22] and EfficientDet [23]. The other branch contains a two-phase approach , for example Faster R-CNN [5]、FPN [21]、Mask RCNN [8]、Cascade R-CNN [9] and Libra R-CNN [24]. lately , The attention of academia has turned to the anchor free detector , Part of the reason is FPN [21] And focus loss [6] Appearance , A more elegant end-to-end detector is proposed . One side ,FSAF [25]、FCOS [26]、ATSS [7] and GFL [27] Using the center based anchor free method, the RetinaNet. On the other hand ,CornerNet [28] and CenterNet [29] Use the key based method to detect the object bounding box .

lately , Neural architecture search (NAS) The architecture is applied to automatically search for specific detectors .NAS-FPN [22]、NAS-FCOS [30] and SpineNet [31] Use reinforcement learning to control architecture sampling and obtain promising results .SM-NAS [32] Evolutionary algorithm and partial order pruning method are used to search the best combination of different parts of the detector .Auto-FPN [33] A gradient based method is used to search for the best detector .DetNAS [34] and OPANAS [35] The one-time method is used to search the effective trunk and neck for target detection .

** Trunk of object detection .** from AlexNet [2] Start , Mainstream detectors have taken advantage of deeper and wider backbone networks , for example VGG [37]、ResNet [13]、DenseNet [38]、ResNeXt [14] and Res2Net [15]. Because backbone networks are usually designed for classification , Whether in the ImageNet Pre training and fine tuning on a given detection data set , Or start training from scratch on the detection data set , Both require a lot of computing resources and are difficult to optimize . lately , Two non trivial design backbones , namely DetNet [39] and FishNet [40], It is specially designed for detection tasks . however , Before fine tuning the detection task , They still need pre training for classification tasks .Res2Net [15] By representing multi-scale features at the granularity level and increasing the receptive field range of each network layer , Impressive results have been achieved in target detection . In addition to manually designing the backbone architecture ,DetNAS [34] Also used NAS Find a better trunk for target detection , This reduces the cost of manual design . Although expensive pre training is required ,Swin Transformer [19] utilize Transformer Module to build the backbone and achieved impressive results .

as everyone knows , Design and pre train a new 、 A robust backbone requires a lot of computing cost . perhaps , We propose a more economical approach 、 A more effective solution , By assembling multiple identical existing backbones ( for example ,ResNet [13]、ResNeXt[14]、Res2Net [15] and Swin Transformer [19]) To build a more powerful object detection backbone .



Cyclic convolution neural network .  And CNN Different feedforward architectures ,Recurrent CNN (RCNN) [20] Merge cyclic connections into each convolution layer . This property enhances the ability of the model to integrate context information , This is important for object recognition . Pictured 3 Shown , We propose a composite backbone network and expand it RCNN [20] There are some similarities , But they are very different . First , Pictured 3 Shown ,CBNet The connection between parallel phases in is unidirectional , But they are RCNN It's two-way . secondly , stay RCNN in , Parallel stages with different time steps share parameter weights , And in the proposed CBNet in , The parallel phases of the backbone are independent of each other . Besides , If we use it as the backbone of the detector , We need to be in ImageNet Pre training RCNN. by comparison ,CBNet No additional pre training is required , Because it directly uses the existing pre training weights .

3、 Methods of this paper

This section details the proposed CBNetV2. In the 3.1 Section and section 3.2 In the festival , We describe its basic architecture and variants respectively . In the 3.3 In the festival , We put forward based on CBNet Training strategy of detector . In the 3.4 In the festival , We briefly introduce the pruning strategy . In the 3.5 In the festival , We summarized CBNetV2 The detection framework of .

3.1 CBNetV2 The architecture of

Proposed CBNetV2 from K Same trunk (K ≥2) form . Specially , We will K = 2 The situation of ( Pictured 3.a Shown ) be called DualBackbone (DB), take K=3 This is called Triple-Backbone (TB).

Pictured 1 Shown ,CBNetV2 The architecture includes two types of backbone : Boot trunk and secondary trunk . Each trunk contains L Stages ( Usually it is L=5), Each stage consists of several convolution layers with the same size feature mapping . The second part of the trunk l The stage realizes nonlinear transformation

Most traditional convolutional networks follow the design of encoding the input image into intermediate features with monotonically reduced resolution . Specially , The first l Level general front () Output of stage ( Write it down as ) As input , It can be expressed as :

The difference is , We use auxiliary backbone   Improve lead backbone BK Representation of . We iterate the characteristics of the trunk to its successors in a phased manner . therefore , equation (1) I could rewrite it as :

For object detection tasks , Only trunk {} The output features are fed into the neck , And then there was RPN/ Detection head , The output of the secondary backbone is forwarded To his successor brothers and sisters . It is worth noting that ,  It can be used in various backbone architectures ( for example ,ResNet [13]、ResNeXt [14]、Res2Net [15] and Swin Transformer [19]) And initialization Pre training weights directly from a single trunk .

3.2 Possible composite styles

For composite connections , It gets from the secondary backbone   As input and output and    Features of the same size ( Omitted for simplicity k), We proposed There are five different composite styles .

3.2.1 Peer combination (SLC)


An intuitive and simple synthesis method is to fuse the output features from the same stage of the trunk . Pictured 2.a Shown ,SLC The operation of can be expressed as :

among w Representing one 1 × 1 A convolution layer and a batch normalization layer .

3.2.2 Adjacent advanced combination (AHLC)

Characteristic pyramid network [21] Inspired by the , The top-down path introduces spatially coarser but semantically more powerful advanced features , To enhance the low-level characteristics of the bottom-up path . Prior to CBNet [1] in , We did Adjacent Higher-Level Composition (AHLC), Feed the output of the adjacent higher-level stage of the previous trunk to the subsequent stage ( chart 2.b Center left to right ) :

among    Indicates an upsampling operation .

3.2.3 Adjacent low-level combination (ALLC)

And AHLC Different , We introduce a bottom-up path , Provide the output of the adjacent low-level stages of the previous trunk to the next .Adjacent Lower-Level Composition (ALLC) This operation is shown in the figure 2.c Shown , Formula for :

among    Indicates the down sampling operation .

3.2.4 Dense high-rise combination (DHLC)

stay DenseNet [38] in , Each layer is connected to all subsequent layers to build integrated features . Inspired by this , We are CBNet Use dense composite connections in the architecture .DHLC The operation of is expressed as follows :

Pictured 2.d Shown , When K = 2 when , We combine features from all the higher-level stages in the previous trunk , And add them to the lower level stage in the latter .

3.2.5 Full connection combination (FCC)

Pictured 2.e Shown , We combine the characteristics of all the stages in the previous trunk , And provide them to each stage of the next stage . And DHLC comparison , We are lowhigh-level Added connection in case of .FCC The operation of can be expressed as :

among    Represents the scale , When   when ,  = D(·), When $i

3.3 Auxiliary supervision



Although increasing depth usually leads to performance improvement [13], However, it may introduce additional optimization difficulties , Such as image classification [41] The situation of .[42]、[43] The middle layer auxiliary classifier is introduced to improve the convergence of very deep networks . In primitive CBNet in , Although the composite trunk is parallel , But the latter trunk ( for example , chart 4.a Boot trunk in ) Through the previous trunk ( for example , chart 4.a Secondary backbone in ) Adjacent connections between deepen the network . For better training CBNet The detector , We propose to generate the initial results of the auxiliary backbone through the supervision of the auxiliary neck and detection head , To provide additional regularization .

When K=2 when , Our supervision CBNetV2 The sample is shown in figure 4.b Shown . In addition to using leading backbone features to train the detection head 1 In addition to the original loss , Another detection head 2 Take auxiliary backbone features as input , Generate auxiliary supervision . Please note that , Detection head 1 And the detector head 2 Weight sharing , So are the two necks . Auxiliary supervision helps to optimize the learning process , The original loss of the trunk bears the greatest responsibility . We add weights to balance auxiliary supervision , The total loss is defined as :

among    It's the loss of the trunk , It is the loss of the auxiliary trunk ,λ  It's No i Loss weight of secondary backbone .

In the reasoning stage , We abandoned the auxiliary supervision branch , Only used CBNetV2 Output characteristics of backbone in ( chart 4.b). therefore , Auxiliary supervision does not affect the speed of reasoning .

3.4 CBNetV2 The pruning strategy



In order to reduce CBNetV2 The complexity of the model , We explored the second  t The possibility of trimming the number of different stages in the layer trunk , Instead of combining the trunk as a whole ( namely , Add the same trunk to the original trunk ). For the sake of simplicity , We are in the picture 5 Is shown in K = 2 The situation of . There are five ways to trim the trunk .  Indicates that there are in the trunk i Stages  , The trimmed stage is filled with features of the same stage The first backbone . Details can be found on page 4.4.4 Found in section .

3.5 CBNetV2 Detect network architecture

CBNetV2 It can be applied to a variety of ready-made detectors , Without additional modifications to the network architecture . In practice , We connect the backbone to the functional network , for example FPN [21] And the detector head . For object detection CBNetV2 The reasoning stage is shown in the figure 1 Shown .

4、 experiment

In this section , We evaluate our proposed method through a large number of experiments . In the 4.1 In the festival , We introduce the experimental device in detail . In the 4.2 In the festival , We compare our method with the most advanced detection methods . In the 4.3 In the festival , We demonstrate the generality of our method by experiments on different backbones and detectors . In the 4.4 In the festival , We conducted extensive ablation studies and analyses , To investigate the various components of our framework . Last , We are the first 4.5 Some qualitative results of our proposed method are shown in section .

4.1 Implementation details

4.1.1 Data sets and evaluation criteria

We are COCO [18] Experiment on the benchmark . Training in 118k On the training image , And in 5k Ablation studies were performed on micro images . We also test-dev It was reported in 20k The result of the image , With the most advanced (SOTA) Methods for comparison . For evaluation , We use COCO Indicators in test and evaluation criteria , Including different scales IoU Threshold range from 0.5 To 0.95 The average accuracy of (AP).

4.1.2 Training and reasoning details

Our experiment is based on the open source detection toolkit MMDetection [48]. For ablation studies and simple comparisons , If not specified , During training and reasoning, we resize the input to 800 × 500. We choose to FPN [21] As baseline Faster R-CNN (ResNet50 [13]). We use SGD Optimizer , The initial learning rate is 0.02, Momentum is 0.9, The weight decays to 10−4. We trained 12 individual epoch The detector , stay epoch 8 and 11 The learning rate has decreased 10 times . We only use random flipping for data enhancement and set the batch size to 16. Please note that , Not highlighted and Swin Transformer Relevant experiments are specially followed hyper - [19] Parameters of . Reasoning speed of detector FPS( Frames per second ) Is in 1 block V100 GPU Measured on the machine .

For comparison with state-of-the-art detectors , We use multiscale training [49]( Short side adjusted to 400 ∼ 1400, Maximum length of long side 1600) And longer workouts ( Details can be found on page 4.2 Found in section ). In the reasoning stage , We use SoftNMS [50], The threshold for 0.001, The input size is set to 1600 × 1400. If not specified , All other hyperparameters in this article follow MMDetection.

4.2 Comparison with the most advanced technology

We compare our method with cutting-edge detectors . We divide the results into object detection according to whether we use instance segmentation annotation during training ( surface 1) And instance segmentation ( surface 2). Following [19] after , We add four convolution layers in each bounding box header [54] And use GIoU Loss [55] Instead of smoothing to improve the performance in the above two tables Cascade R-CNN、Cascade Mask RCNN and HTC Detector head L1 [56].



4.2.1 Object detection

For detectors trained with bounding box annotation only , We classify them into two categories : Based on anchor point And based on no anchor in the table 1 in . We choose ATSS [7] As our anchor free representative ,Cascade R-CNN As our anchor based representative .

Without anchor .  Equipped with ATSS Of Dual-Res2Net101-DCN Trained 20 Period , Among them, the learning rate is the third 16 And the 19 The period decayed 10 times . It is worth noting that , our Dual-Res2Net101-DCN Realized 52.8% Of AP, It is better than the previous anchor free method in single scale [7]、[25]、[26]、[27]、[30]、[36]、[44] Test protocol .

Based on anchor point .  our Dual-Res2Net101-DCN Realized 55.6% Of AP, More than other anchor based detectors [22]、[23]、[31]、[32]、[33]、[35]、[46]、[57]. It is worth noting that , our CBNetV2 Just training 32 individual epoch( front 20 individual epoch It's routine training , rest 12 individual epoch Average training with random weights [58]), , respectively, compared with the EfficientDet and YOLOv4 short 16 Times and 12 times .

4.2.2 Instance segmentation



We use tables 2 The bounding box and instance segmentation annotation in further integrate our method with the most advanced results [19]、[51]、[52]、[53] Compare . stay [19] after , We provide the results In the routine ImageNet-1K and ImageNet-22K On the trunk of the pre training to show CBNetV2 High capacity .

routine ImageNet-1K Results of pre training .  follow [19],3x plan (36 Period , Learning rate at 27 and 33 Period attenuation 10 times ) be used for Dual-Swin-S. Use Cascade Mask R-CNN, our Dual-Swin-S stay COCO minival It realizes the boundary box and instance segmentation 56.3% Of box AP and 48.6% Of mask AP, Shows +4.4% box AP and +3.6% mask Significant gain of With similar model size and the same training protocol AP To Swin-B. Besides ,Dual-Swin-S stay COCO dev It has been realized. 56.9% Of box AP and 49.1% Of mask AP, Better performance than others ImageNet-1K Pre trained trunk based detector .

ImageNet-22K Results of pre training .  our Dual-Swin-B stay COCO minival It has been realized. 58.4% box AP and 50.7% mask AP Single scale results , Than Swin-L (HTC++) [19] high 1.3% box AP and 1.2% mask AP, and The number of parameters is reduced 17%, Training programs have been reduced 3.6 times . especially , Just use 12 individual epochs Training ( Than Swin-L short 6 times ), our Dual-Swin-L stay COCO test-dev It has been realized. 59.4% Of box AP and 51.6% Of mask AP, Superior to prior art . We can push the best results to 60.1% Of box AP and 52.3% Of mask AP New record of . These results suggest that , our CBNetV2 An efficient method is proposed 、 An effective and resource-friendly framework to build high-performance backbone networks .

4.3 CBNetV2 The generality of

CBNetV2 Expand the receptive field by combining the trunk in parallel , Instead of simply increasing the depth of the network . In order to prove the effectiveness and generality of our design strategy , We have experimented with various trunk and different head designs of the detector architecture .

4.3.1 Commonality of mainstream backbone architecture


Validity in order to prove CBNetV2 The effectiveness of the , We have different backbone architectures Faster R-CNN Experiments were carried out . As shown in the table 3 Shown , Based on CNN The backbone of ( for example ,ResNet、ResNeXt-32x4d and Res2Net), Our method can be improved Baseline exceeded 3.4% AP. Besides ,CBNetV2 Not only compatible based on CNN The backbone of , Also compatible based on Transformer The backbone of ( See 4.3.2 section ).



efficiency   Please note that , Compared to the baseline ,CBNetV2 The number of parameters in has increased . To better demonstrate the efficiency of composite architecture , We will CBNetV2 And deeper 、 Wider backbone networks are compared . As shown in the table 4 Shown , stay FLOP When the number and reasoning speed are the same ,CBNetV2 Separately ResNet101、ResNeXt101-32x4d、Res2Net101 Of AP Improved 1.7%、2.1% and 1.1%. Besides ,DualResNeXt50-32x4d Of AP Than ResNeXt101-64x4d high 1.1%, The number of parameters is only 70%. It turns out that , Our composite backbone architecture is more effective than simply increasing the depth and width of the network .

4.3.2 Swin Transformer Generality of

Transformer Known for using attention to simulate remote dependencies in data ,Swin Transformer [19] It is one of the most representative Arts recently . say concretely ,Swin Transformer It's a universal Transformer The trunk , It constructs a hierarchical feature map , And it has linear computational complexity in image size . We are Swin Transformer Experiment on to show CBNetV2 Model generality . For a fair comparison , We follow and [19] Same training strategy , Multi scale training ( Short side adjusted to 480 ∼ 800, Maximum length of long side 1333),AdamW Optimizer ( The initial learning rate is 0.0001, The weight decays to 0.05, Batch size is 16) and 3x plan (36 Period ).



As shown in the table 5 Shown , The accuracy of the model increases with Swin Transformer It increases slowly with deepening and widening , And in Swin-S Saturation at .Swin-B only Swin-S high 0.1% AP, But the number of parameters increased 38M. In the use of Dual-Swin-T when , We improve Swin-T 3.1% Of box AP and 2.5% Of mask AP, Realized 53.6% Of box AP and 46.2% Of mask AP. It's amazing , our Dual-Swin-T Deeper than 、 Wider Swin-B high 1.7% The box of AP and 1.2% The mask of AP, The model complexity is lower ( for example ,FLOPs 836G vs. 975G,Params 113.8M vs. 145.0 rice ). These results prove that CBNetV2 The non pure convolution architecture can also be improved . They also proved that CBNetV2 The accuracy upper limit of high-performance detector is promoted more effectively than simply increasing the depth and width of the network .

4.3.4 CBNetV2 And Deformable Convolution The compatibility of



Deformable convolution [59] Enhanced CNN Transformation modeling capability , It is widely used in accurate target detector ( for example , Simply add DCN take Faster R-CNN ResNet50 from 34.6% Up to 37.4% AP). To show our CBNetV2 Compatibility of architecture and deformable convolution , We are equipped with Faster R-CNN Of ResNet and ResNeXt We did experiments on . As shown in the table 7 Shown ,DCN Yes Dual-Backbone Still valid ,AP Improved 2.3%~2.7%. This improvement is greater than ResNet152 and ResNeXt101-64x4d Upper 2.0% AP and 1.3% AP The incremental . On the other hand ,DualBackbone take ResNet50-DCN Improved 3.0% AP, Deeper than ResNet152-DCN Improved 0.6%. Besides ,DualBackbone send ResNet50-32x4d-DCN Added 3.7% AP, namely Deeper and wider than ResNeXt101-64x4d-DCN high 1.3%. It turns out that ,CBNetV2 And deformable convolution can be superimposed , There is no conflict .



4.4 Melting research

We put forward for us CBNetV2 Eliminates design options . For the sake of simplicity , If not specified , All accuracy results here are in the input size of 800 × 500 Of COCO On validation set .

4.4.1 Effectiveness of different composite styles

We conducted experiments to compare the graphs 2 Composite styles proposed in , Include SLC、AHLC、ALLC、DHLC and FCC. All these experiments are based on Faster R-CNN DualResNet50 Architecture . Results such as table 8 Shown .

SLC  Slightly better than the single backbone baseline . We speculate that this is because the features extracted at the same stage of the two trunks are similar , therefore SLC Only a little more semantic information can be learned than a single trunk .

AHLC   Increased baseline 1.4% AP, This confirms that we are 3.2.2 Motivation in section , namely , If the higher-level features of the secondary backbone are fed to the lower level of the primary backbone , Then the semantic information of the latter Will be enhanced .

DHLC   Significant improvement in baseline performance ( from 34.6% AP To 37.3% AP from 2.7% AP). More composite connections between high and low cases enrich the representation ability of features to a certain extent .

FCC  Fully connected architecture 37.4% AP Best performance of .

in summary ,FCC and DHLC Two best results were achieved . Considering the simplicity of calculation , We suggest that CBNetV2 Use DHLC. The parameters of all the above composite styles are almost the same , But the accuracy varies greatly . These results prove that , Simply increasing the number of parameters or increasing the backbone network does not guarantee better results . therefore , Composite connection is the key to the backbone network . These results suggest that , Suggested DHLC Composite styles are effective and important .

4.4.2 Different weights of auxiliary supervision

The experimental results related to weighted auxiliary supervision are shown in table 9 Shown . For the sake of simplicity , We are CBNetV2 On the implementation DHLC Composite style . The first setting is Faster R-CNN Dual-ResNet50 The baseline , The second setting is Triple-ResNet50 The baseline , Where equation (8) Secondary backbone in λ Set to zero . For dual backbone (DB) structure , By way of λ1 Set to 0.5, The baseline can be improved 0.8% AP. about TripleBackbone(TB) structure , Set separately by setting {λλ} The value of is {0.5,1.0},baseline Can improve 1.8% AP. The experimental results verify that auxiliary supervision forms an effective training strategy , Can improve CBNetV2 Performance of .

4.4.3 Effectiveness of each component

For further analysis CBNetV2 Importance of each component in , Composite backbone 、DHLC Compound style and auxiliary supervision are gradually applied to the model to verify its effectiveness . We are CBNet [1] Choose from AHLC As default composite style .



The results are summarized in table 10 in . It shows that DualBackbone (DB) and Triple-Backbone (TB) Increased the baseline respectively 1.4% and 1.8% AP. It validates our composite backbone structure (CBNet [1]) The effectiveness of the .DHLC Composite styles further DB and TB The detection performance is improved 1.0% The above AP. It turns out that ,DHLC Greater receptive field is realized , Each level of feature obtains rich semantic information from all higher-level features . Auxiliary supervision is DB and TB bring 1.0% Of AP The incremental , Thanks to the supervision of the auxiliary backbone, a better training strategy is formed , The representative ability of the leading backbone has been improved . Please note that , Auxiliary supervision does not introduce additional parameters in the reasoning stage . When these three components are combined , There was a significant improvement compared with baseline . With auxiliary supervision DHLC type DB and TB Reach respectively 37.9% AP and 38.9% AP,AP Increment of +3.3% and +4.3%. Besides ,CBNetV2 stay DB and TB Aspects will be CBNet [1] Improved 1.8% and 1.7% AP. In short ,CBNetV2 Each component in brings improvements to the detector , And they complement each other .

4.4.4 Efficiency of pruning strategy



Pictured 6 Shown , Use 78 pruning strategy , our DualResNet50 Series and Triple-ResNet50 The series achieves better performance than ResNet The series is better FLOPs Accuracy tradeoff . This also shows the efficiency of our pruning strategy . especially s3 Medium FLOPs Number comparison s4 Less 10%, But the accuracy has only decreased 0.1%. This is because the weight of the pruning phase is fixed during detector training [48], Therefore, the pruning stage will not sacrifice the detection accuracy . therefore , When speed and memory cost need to be given priority , We suggest that CBNetV2 Middle trim section 2,3,…… The first K Fixed phase in backbone .

4.4.5 Effectiveness of different trunk numbers



CBNetV2 In order to further explore CBNetV2 Ability to build high performance detectors , We evaluate by controlling the number of backbones CBNetV2(s3 edition ) The efficiency of . Pictured 7 Shown , We changed the number of trunks ( for example ,K = 1,2,3,4,5) And their performance and computational cost (GFLOP) And ResNet Compare series . Please note that , With the increase of model complexity , Accuracy will continue to increase . And ResNet152 comparison , Our approach is K=2 Higher accuracy when , At the same time, the calculation cost is lower . meanwhile , about K=3,4,5, The accuracy can be further improved .CBNetV2 It provides an effective and efficient alternative to improve the performance of the model , Instead of simply increasing the depth or width of the network .

4.5 Class activation graph

Just to understand CBNetV2 Representativeness of , We use Grad-CAM[60] Visual class activation diagram (CAM), It is usually used to locate the discrimination region of image classification and object detection . Pictured 8 Shown , Stronger CAM The area is shallow / Warmer color overlay . In order to better illustrate CBNetV2 Multi-scale detection capability , We will be the first to 2 Stage ( Used to detect small objects ) Large scale characteristic map and Chapter 5 Stage ( Used to detect large objects ) The small-scale characteristic map is determined by our Dual- ResNet50 and ResNet50. And ResNet comparison , be based on Dual-ResNet Of CAM It turns out that in the second 5 There are more concentrated activation diagrams on large objects with stage characteristics , For example, figure 8 Medium “ people ”、“ Dog ”, and ResNet Only partially covers the object or interferes with the background . On the other hand ,Dual-ResNet Have a stage for 2 Features of small objects have a stronger ability to distinguish , For example, figure 8(a) Medium “kite”,(b) Medium “skateboard”,(c) Medium “surfboard”, and (d,e) Medium “ Tennis racket ”, and ResNet There is little activation in these parts .



5 Conclusion

In this paper , We propose a novel and flexible backbone framework , It is called composite backbone network V2(CBNetV2), To improve the performance of the tip target detector .CBNetV2 It consists of a series of backbone parallel networks with the same network architecture 、Dense Higher-Level Composition Style and auxiliary supervision composition . They have jointly built a strong representative backbone network , The network uses the existing pre training backbone network under the pre training fine-tuning paradigm , This also provides a superior method for target detection .CBNetV2 It has strong generalization ability for different trunk and head designs of detector architecture . A lot of experimental results show that , The proposed CBNetV2 Compatible with various backbone networks , Including based on CNN(ResNet、ResNeXt、Res2Net) And based on Transformer(SwinTransformer) The backbone network . meanwhile ,CBNetV2 It is more effective and efficient than simply increasing the depth and width of the network . Besides ,CBNetV2 It can be flexibly inserted into most mainstream detectors , Including level 1 ( for example RetinaNet) And secondary (Faster R-CNN、Mask R-CNN、Cascade R-CNN and Cascade Mask R-CNN) detector , And Anchor Based ( for example ,Faster R-CNN) And Anchor Based (ATSS) Of . say concretely , The performance of the detector is improved 3% The above AP. especially , our Dual-Swin-L stay COCO test-dev It has been realized. 59.4% Of box AP and 51.6% Of mask AP New record of , It is better than the previous single model and single scale results . Through multi-scale test , We achieved... Without additional training data 60.1% Of box AP and 52.3% Of mask AP Latest results of .

Please bring the original link to reprint ,thank
Similar articles