# Self training multi sequence learning with transformer for weakly supervised video animation

Inge2022-06-23 18:04:08

# 0 introduce

## 0.1 subject

2022AAAI： Transform self-learning multi sequence learning and weakly supervised video anomaly detection (Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection)

## 0.2 background

Use multiple instances to learn (Multi-instance learning, MIL) Weak supervised video anomaly detection (Viswo anomaly detection, VAD) Usually based on the fact that the abnormal score of the abnormal segment is higher than that of the normal segment . At the beginning of model training , Due to the lack of accuracy of the model, it is easy to select the wrong abnormal segment .

## 0.3 Method

1） In order to reduce the probability of wrong choice , A multi sequence learning algorithm is proposed (Multi-sequence learning, MSL) Method and a method based on MSL Sort loss , It uses a sequence of multiple fragments as an optimization unit ;
2） A transformation based MSL Network to learn both video level abnormal probability and segment level abnormal score ;
3） In the reasoning stage , Video abnormal probability is used to suppress the fluctuation of abnormal score at segment level ;
4） because VAD Need to predict segment level exception scores , By gradually reducing the length of the selected sequence , A self training strategy is proposed to refine the abnormal score step by step .

## 0.4 Bib

@inproceedings{
Li:2022:self,
author = {
Shuo Li and Fang Liu and Li Cheng Jiao},
title = {
Self-training multi-sequence learning with Transformer for weakly supervised video anomaly detection},
journal = {
{
AAAI} Conference on Artificial Intelligence},
year = {
2022}
}


# 1 Algorithm

## 1.1 Symbols and problem statements

In weak supervision VAD field , The video annotation information is only given at the video level . That is, when the video contains abnormal information, it is marked as 1 ( just ), Instead of 0 ( negative ). Given an inclusion T T Video clips V = { v i } i = 1 T V=\{v_i\}_{i=1}^T , Its video level label is Y ∈ { 0 , 1 } Y\in\{0,1\} . be based on MIL Method will V V As a package , v i v_i As an example . therefore , A positive video is regarded as a positive package B a = ( a 1 , a 2 , … , a T ) \mathcal{B}_a=(a_1,a_2,\dots,a_T) , A negative video is considered a negative packet B n = ( n 1 , n 2 , … , n T ) \mathcal{B}_n=(n_1,n_2,\dots,n_T) .
VAD Of The goal is Is to learn a method of mapping fragments to [ 0 , 1 ] [0, 1] A function of an interval f θ f_\theta . be based on MIL Of VAD hypothesis The abnormal score of abnormal segment is higher than that of normal segment .Sultani People will VAD As an abnormal scoring problem , A sort objective function and MIL Sort loss ：
max ⁡ i ∈ B a f θ ( a i ) > max ⁡ i ∈ B n f θ ( n i ) . (1) \tag{1} \max_{i\in\mathcal{B}_a}f_\theta(a_i)>\max_{i\in\mathcal{B}_n}f_\theta(n_i). L ( B a , B n ) = max ⁡ ( 0 , max ⁡ i ∈ B a f θ ( a i ) − max ⁡ i ∈ B n f θ ( n i ) ) . (2) \tag{2} \mathcal{L}(\mathcal{B}_a,\mathcal{B}_n)=\max(0,\max_{i\in\mathcal{B}_a}f_\theta(a_i)-\max_{i\in\mathcal{B}_n}f_\theta(n_i)). In order to make the gap between positive instances and negative instances as large as possible ,Sultani Provides a Hinge loss function
L ( B a , B n ) = max ⁡ ( 0 , 1 − max ⁡ i ∈ B a f θ ( a i ) + max ⁡ i ∈ B n f θ ( n i ) ) . (3) \tag{3} \mathcal{L}(\mathcal{B}_a,\mathcal{B}_n)=\max(0,1-\max_{i\in\mathcal{B}_a}f_\theta(a_i)+\max_{i\in\mathcal{B}_n}f_\theta(n_i)). At the beginning of optimization , f θ f_\theta It requires a certain ability of abnormal prediction , Otherwise it will select a normal instance as the exception instance . In this case of error , The error will be extended to the whole training process . Besides , The abnormal part is usually multiple consecutive fragments , But based on MIL The method does not consider this a priori .

## 1.2 MSL

To alleviate the above MIL The inadequacy of the method , We put forward a novel MSL Method . Such as chart 2 Shown , Given contains T T Video clips V = { v i } i = 1 T V=\{v_i\}_{i=1}^T , First, through the mapping function f θ f_\theta The abnormal score curve is predicted . Hypothesis number 1 5 Clips have the largest abnormal score f θ ( v 5 ) f_\theta(v_5) . Based on MIL The methods of , The first 5 Pieces will be selected to optimize the network . In what is proposed MSL in , We provided one Sequence selection method , It will select include K K A sequence of consecutive fragments . In particular , We calculated K K The average of the anomaly scores of all possible sequences of consecutive fragments ：
S = { s i } i = 1 T − K , s i = 1 K ∑ k = 0 K 1 f θ ( v i + k ) , (4) \tag{4} S=\{s_i\}_{i=1}^{T-K},\qquad s_i=\frac{1}{K}\sum_{k=0}^{K_1}f_\theta(v_{i+k}), among s i s_i Says from the first i i The average abnormal score at the beginning of each segment . then , The sequence with the largest outlier score will be selected , namely max ⁡ s i ∈ S s i \max_{s_i\in S}s_i .

chart 2：MIL And the MSL Comparison of method instance selection ：a） contain T T Abnormal score curve of video clips , Hypothesis number 1 5 Clips have the largest abnormal score f θ ( v 5 ) f_\theta(v_5) ;b）MIL Method will select the 5 A fragment ; as well as c）MSL Select from i i A sequence of segments beginning K K A sequence of fragments .

Based on the above sequence selection method , You can get MSL Sort optimization function ：
max ⁡ s a , i ∈ S a s a , i > max ⁡ s n , i ∈ S n s n , i , s a , i = 1 K ∑ k = 0 K − 1 f θ ( a i + k ) , s n , i = 1 K ∑ k = 0 K − 1 f θ ( n i + k ) , (5) \tag{5} \max_{s_{a,i}\in S_a}s_{a,i}>\max_{s_{n,i}\in S_n}s_{n,i},\\ s_{a,i}=\frac{1}{K}\sum_{k=0}^{K-1}f_\theta(a_{i+k}),\qquad s_{n,i}=\frac{1}{K}\sum_{k=0}^{K-1}f_\theta(n_{i+k}), among s a , i s_{a,i} and s n , i s_{n,i} Indicates the abnormal score of abnormal video and normal video respectively . To ensure a large space between positive and negative instances , And The formula 3 similar , our MSL Loss of hinge sorting Is defined as ：
L ( B a , B n ) = max ⁡ ( 0 , 1 − max ⁡ s a , i ∈ S a s a , i + max ⁡ s n , i ∈ S n s n , i ) ) . (6) \tag{6} \mathcal{L}(\mathcal{B}_a,\mathcal{B}_n)=\max(0,1-\max_{s_{a,i}\in S_a}s_{a,i}+\max_{s_{n,i}\in S_n}s_{n,i})). MIL It can be seen as MSL A special case of , When K = 1 K=1 When the two are equivalent ; When K = T K=T when ,MSL Every segment in the task exception video is abnormal .

## 1.3 Transform based MSL The Internet

### 1.3.1 Convolutional transform encoder

converter (Transformer) Use sequence data as input to model long-term associations , It has achieved remarkable results in many fields . The representation between video clips is very important . However , The transformer is not good at learning the local representation of adjacent segments . Inspired by this , Such as chart 1(c ) Shown , We replace the linear projection in the original converter with DW Conv1D Projection . new Transformer Was named Convolutional transform encoder (Convolutional transformer encoder, CTE).

chart 1： General framework .a） contain MSL Converter network (MSLNet) With the skeleton MSL framework . Extract features by estimation F ∈ T × D F\in T\times D And type in MSLNet Get abnormal score , among T T and D D Represents the number of segments and the dimension of a single segment respectively .MSLNet Contains a video classifier , It obtains the instance exception probability p p , And a segment regression , It gets the abnormal score of each segment f θ ( v i ) f_\theta(v_i) .BCE Represents the binary cross entropy loss (Binary cross entropy loss);b） Self training MSL The Conduit , among K K From... Through self-training mechanism T T To 1 Gradual change . Based on sequence selection method ,MSL The optimization process is divided into two steps , That is, the prediction and use of pseudo tags in the selection sequence ;c） Design of convolution transform encoder (CTE).

### 1.3.2 MSL Changing networks

Such as chart 1(a) Shown , The designed architecture consists of a skeleton and MSLNet. Any behavior recognition method can be used as a skeleton , Such as C3D、I3D, as well as VideoSwin. The skeleton in this paper uses the pre training weights on the behavior recognition data set , Each video will get a feature F ∈ T × D F\in T\times D .
MSLNet It contains a video classifier and a segment regressor . The video classifier contains two CTE Layer and a linear header for predicting whether the video contains exceptions ：
p = σ ( W c ⋅ E c [ 0 ] ) , E c = C T E × 2 ( c l a s s t o k e n ∣ ∣ F ) , (7) \tag{7} p=\sigma(\mathcal{W}^c\cdot E^c[0]),\qquad E^c=CTE_{\times2}(class token||F), among W c \mathcal{W}^c Is a parameter of the linear head 、 p p Is the video anomaly prediction probability , as well as class token It is used to predict the convergence of CTE The probability of the upper feature . because VAD It's a dichotomous problem , therefore sigmoid function σ \sigma Selected .
Segment regression is used to predict the abnormal score of each segment ：
f θ ( v i ) = σ ( W r ⋅ E r [ i ] ) , E r = C T E × 2 ( E c ) , (8) \tag{8} f_\theta(v_i)=\sigma(\mathcal{W}^r\cdot E^r[i]),\qquad E^r=CTE_{\times2}(E^c), among W r \mathcal{W}^r Is a parameter of the linear head 、 E r [ i ] E^r[i] It's No i i The characteristics of the fragments . Because the prediction of abnormal score of fragments belongs to regression problem , Therefore, we also choose σ \sigma .
Video classification and segment regression can be regarded as a multi task problem , therefore The overall optimization goal by ：
L = L ( B a , B n ) + B C E ( p , Y ) . (9) \tag{9} \mathcal{L}=\mathcal{L}(\mathcal{B}_a,\mathcal{B}_n)+BCE(p,Y). in order to Reduce the abnormal score of the segment regressor to predict the fluctuation , We propose an intervention phase Abnormal score correction mechanism
f ^ θ ( v i ) = f θ ( v i ) × p . (10) \tag{10} \hat{f}_\theta(v_i)=f_\theta(v_i)\times p.

## 1.4 Self training MSL

Such as chart 1(b) Shown , Self training mechanism is used to refine the training process .MSLNet There are two stages in the training process , This includes the initialization process ： First, get the pseudo tag of the training video Y ^ \hat{\mathcal{Y}} , The clip level pseudo tags pass through the real tags of the video Y \mathcal{Y} obtain , That is, the clip label is equivalent to the real video label .
At the beginning of training , f θ f_\theta The ability to obtain abnormal scores is insufficient , f θ f_\theta Will most likely select the wrong sequence . therefore ,MSL The two stages of are ：
1） Stage 1— Interim phase ： By putting the formula 4 Prediction anomaly score in f θ ( v i ) f_\theta(v_i) Use the pseudo tag of the fragment y ^ i \hat{y}_i Replace , To select the sequence with the maximum pseudo tag average . Based on this sequence s a , i s_{a,i} and s n , i s_{n,i} , And through the hinge sorting loss optimization MSLNet：
L ( B a , B n ) = max ⁡ ( 0 , 1 − s a , i + s n , i ) . (11) \tag{11} \mathcal{L}(\mathcal{B}_a,\mathcal{B}_n)=\max(0,1-s_{a,i}+s_{n,i}). stay E 1 E_1 After the training round ,MSLNet Will have a preliminary ability to predict abnormal scores .
2） Stage 2： This stage uses the formula 5 and 6 To optimize , stay E 2 E_2 Round training , You can get new fragment level pseudo tags Y ^ \hat{\mathcal{Y}} . By halving the sequence length and repeating the above two steps , The ability to predict the score will be gradually refined . Self training MSL Pseudo code such as algorithm 1.

# 2 experiment

## 2.1 Data sets and evaluation indicators

1）ShanghaiTech Is a containing 437 Campus monitoring 130 A campus incident 13 A medium-sized video data set of scenes . However , All the training data are normal . Under weak supervision , Use 238 Training videos and 199 Division of test videos .
2）UCF-Crime It's a large data set , contain 1900 Uncut real street and indoor surveillance video , contain 13 Class exception event , The total duration is 128 Hours . The training set contains 1610 A video with a video level tag , The test set contains 290 A video with frame level labels .
3）XD-Violence It's a large data set , contain 4754 An untrimmed video , The total duration is 217 Hours , And collect... From multiple sources , For example, movies 、 sports 、 Monitoring and CCTV . The training set contains 3954 A video with a video level tag , The test set contains 800 A video with frame level labels .
The evaluation indicators of the first two data sets are AUC and ROC, The latter dataset uses average precision (AP).

## 2.2 Implementation details

1） from Sports-1M Pre training on C3D Of fc6 Layer 4096D features ;
2） From pre training I3D Mixing 5c Layer 1024D features ;
3） stay Kinetics-400 As well as from Kinetics-400 Pre training VideoSwin Of Stage4 Layer extraction 1024D features ;
4） T = 32 T=32 K = { 32 , 16 , 8 , 4 , 1 } K=\{32,16,8,4,1\} D = 16 D=16 ;
5） Optimizer usage SGD、 The learning rate is set to 0.001、 Weight falloff set to 0.0005、 The batch size is set to 64;
6） E 1 = 100 E_1=100 E 2 = 400 E_2=400 ;
7） Every mini-batch from 32 A random selection of normal and abnormal video components . In the exception video , Before random selection 10% As an exception fragment ;
8） stay CTE in ,headers The number of is set to 12, And use a kernel size of 3 Of DW Conv1D.

thank