### List of articles

# 0 introduce

## 0.1 subject

## 0.2 background

Use multiple instances to learn (*Multi-instance learning, MIL*) Weak supervised video anomaly detection (*Viswo anomaly detection, VAD*) Usually based on the fact that the abnormal score of the abnormal segment is higher than that of the normal segment .** At the beginning of model training , Due to the lack of accuracy of the model, it is easy to select the wrong abnormal segment **.

## 0.3 Method

1） In order to reduce the probability of wrong choice , A multi sequence learning algorithm is proposed (*Multi-sequence learning, MSL*) Method and a method based on MSL Sort loss , It uses a sequence of multiple fragments as an optimization unit ;

2） A transformation based MSL Network to learn both video level abnormal probability and segment level abnormal score ;

3） In the reasoning stage , Video abnormal probability is used to suppress the fluctuation of abnormal score at segment level ;

4） because VAD Need to predict segment level exception scores , By gradually reducing the length of the selected sequence , A self training strategy is proposed to refine the abnormal score step by step .

## 0.4 Bib

```
@inproceedings{
Li:2022:self,
author = {
Shuo Li and Fang Liu and Li Cheng Jiao},
title = {
Self-training multi-sequence learning with Transformer for weakly supervised video anomaly detection},
journal = {
{
AAAI} Conference on Artificial Intelligence},
year = {
2022}
}
```

# 1 Algorithm

## 1.1 Symbols and problem statements

In weak supervision VAD field , The video annotation information is only given at the video level . That is, when the video contains abnormal information, it is marked as 1 ( just ), Instead of 0 ( negative ). Given an inclusion $T$ Video clips $V={v_{i}}_{i=1}$, Its video level label is $Y∈{0,1}$. be based on MIL Method will $V$ As a package ,$v_{i}$ As an example . therefore , A positive video is regarded as a positive package $B_{a}=(a_{1},a_{2},…,a_{T})$, A negative video is considered a negative packet $B_{n}=(n_{1},n_{2},…,n_{T})$.

VAD Of ** The goal is ** Is to learn a method of mapping fragments to $[0,1]$ A function of an interval $f_{θ}$. be based on MIL Of VAD** hypothesis ** The abnormal score of abnormal segment is higher than that of normal segment .Sultani People will VAD As an abnormal scoring problem , A sort objective function and MIL Sort loss ：

$i∈B_{a}max f_{θ}(a_{i})>i∈B_{n}max f_{θ}(n_{i}).(1)$$L(B_{a},B_{n})=max(0,i∈B_{a}max f_{θ}(a_{i})−i∈B_{n}max f_{θ}(n_{i})).(2)$ In order to make the gap between positive instances and negative instances as large as possible ,Sultani Provides a ** Hinge loss function **：

$L(B_{a},B_{n})=max(0,1−i∈B_{a}max f_{θ}(a_{i})+i∈B_{n}max f_{θ}(n_{i})).(3)$ At the beginning of optimization ,$f_{θ}$ It requires a certain ability of abnormal prediction , Otherwise it will select a normal instance as the exception instance . In this case of error , The error will be extended to the whole training process . Besides , The abnormal part is usually multiple consecutive fragments , But based on MIL The method does not consider this a priori .

## 1.2 MSL

To alleviate the above MIL The inadequacy of the method , We put forward a novel MSL Method . Such as chart 2 Shown , Given contains $T$ Video clips $V={v_{i}}_{i=1}$, First, through the mapping function $f_{θ}$ The abnormal score curve is predicted . Hypothesis number 1 5 Clips have the largest abnormal score $f_{θ}(v_{5})$. Based on MIL The methods of , The first 5 Pieces will be selected to optimize the network . In what is proposed MSL in , We provided one ** Sequence selection method **, It will select include $K$ A sequence of consecutive fragments . In particular , We calculated $K$ The average of the anomaly scores of all possible sequences of consecutive fragments ：

$S={s_{i}}_{i=1},s_{i}=K1 k=0∑K_{1} f_{θ}(v_{i+k}),(4)$ among $s_{i}$ Says from the first $i$ The average abnormal score at the beginning of each segment . then , The sequence with the largest outlier score will be selected , namely $max_{s_{i}∈S}s_{i}$.

chart 2：MIL And the MSL Comparison of method instance selection ：a） contain $T$ Abnormal score curve of video clips , Hypothesis number 1 5 Clips have the largest abnormal score $f_{θ}(v_{5})$;b）MIL Method will select the 5 A fragment ; as well as c）MSL Select from $i$ A sequence of segments beginning $K$ A sequence of fragments .

Based on the above sequence selection method , You can get MSL Sort optimization function ：

$s_{a,i}∈S_{a}max s_{a,i}>s_{n,i}∈S_{n}max s_{n,i},s_{a,i}=K1 k=0∑K−1 f_{θ}(a_{i+k}),s_{n,i}=K1 k=0∑K−1 f_{θ}(n_{i+k}),(5)$ among $s_{a,i}$ and $s_{n,i}$ Indicates the abnormal score of abnormal video and normal video respectively . To ensure a large space between positive and negative instances , And The formula 3 similar , our **MSL Loss of hinge sorting ** Is defined as ：

$L(B_{a},B_{n})=max(0,1−s_{a,i}∈S_{a}max s_{a,i}+s_{n,i}∈S_{n}max s_{n,i})).(6)$MIL It can be seen as MSL A special case of , When $K=1$ When the two are equivalent ; When $K=T$ when ,MSL Every segment in the task exception video is abnormal .

## 1.3 Transform based MSL The Internet

### 1.3.1 Convolutional transform encoder

converter (*Transformer*) Use sequence data as input to model long-term associations , It has achieved remarkable results in many fields . The representation between video clips is very important . However , The transformer is not good at learning the local representation of adjacent segments . Inspired by this , Such as chart 1(c ) Shown , We replace the linear projection in the original converter with DW Conv1D Projection . new Transformer Was named ** Convolutional transform encoder ** (*Convolutional transformer encoder, CTE*).

chart 1： General framework .a） contain MSL Converter network (MSLNet) With the skeleton MSL framework . Extract features by estimation $F∈T×D$ And type in MSLNet Get abnormal score , among $T$ and $D$ Represents the number of segments and the dimension of a single segment respectively .MSLNet Contains a video classifier , It obtains the instance exception probability $p$, And a segment regression , It gets the abnormal score of each segment $f_{θ}(v_{i})$.BCE Represents the binary cross entropy loss (*Binary cross entropy loss*);b） Self training MSL The Conduit , among $K$ From... Through self-training mechanism $T$ To 1 Gradual change . Based on sequence selection method ,MSL The optimization process is divided into two steps , That is, the prediction and use of pseudo tags in the selection sequence ;c） Design of convolution transform encoder (CTE).

### 1.3.2 MSL Changing networks

Such as chart 1(a) Shown , The designed architecture consists of a skeleton and MSLNet. Any behavior recognition method can be used as a skeleton , Such as C3D、I3D, as well as VideoSwin. The skeleton in this paper uses the pre training weights on the behavior recognition data set , Each video will get a feature $F∈T×D$.**MSLNet It contains a video classifier and a segment regressor **. The video classifier contains two CTE Layer and a linear header for predicting whether the video contains exceptions ：

$p=σ(W_{c}⋅E_{c}[0]),E_{c}=CTE_{×2}(classtoken∣∣F),(7)$ among $W_{c}$ Is a parameter of the linear head 、$p$ Is the video anomaly prediction probability , as well as *class token* It is used to predict the convergence of CTE The probability of the upper feature . because VAD It's a dichotomous problem , therefore sigmoid function $σ$ Selected .

Segment regression is used to predict the abnormal score of each segment ：

$f_{θ}(v_{i})=σ(W_{r}⋅E_{r}[i]),E_{r}=CTE_{×2}(E_{c}),(8)$ among $W_{r}$ Is a parameter of the linear head 、$E_{r}[i]$ It's No $i$ The characteristics of the fragments . Because the prediction of abnormal score of fragments belongs to regression problem , Therefore, we also choose $σ$.

Video classification and segment regression can be regarded as a multi task problem , therefore ** The overall optimization goal ** by ：

$L=L(B_{a},B_{n})+BCE(p,Y).(9)$ in order to ** Reduce the abnormal score of the segment regressor to predict the fluctuation **, We propose an intervention phase ** Abnormal score correction mechanism **：

$f^ _{θ}(v_{i})=f_{θ}(v_{i})×p.(10)$

## 1.4 Self training MSL

Such as chart 1(b) Shown , Self training mechanism is used to refine the training process .MSLNet There are two stages in the training process , This includes the initialization process ： First, get the pseudo tag of the training video $Y^ $, The clip level pseudo tags pass through the real tags of the video $Y$ obtain , That is, the clip label is equivalent to the real video label .

At the beginning of training ,$f_{θ}$ The ability to obtain abnormal scores is insufficient ,$f_{θ}$ Will most likely select the wrong sequence . therefore ,MSL The two stages of are ：

1） Stage 1— Interim phase ： By putting the formula 4 Prediction anomaly score in $f_{θ}(v_{i})$ Use the pseudo tag of the fragment $y^ _{i}$ Replace , To select the sequence with the maximum pseudo tag average . Based on this sequence $s_{a,i}$ and $s_{n,i}$, And through the hinge sorting loss optimization MSLNet：

$L(B_{a},B_{n})=max(0,1−s_{a,i}+s_{n,i}).(11)$ stay $E_{1}$ After the training round ,MSLNet Will have a preliminary ability to predict abnormal scores .

2） Stage 2： This stage uses the formula 5 and 6 To optimize , stay $E_{2}$ Round training , You can get new fragment level pseudo tags $Y^ $. By halving the sequence length and repeating the above two steps , The ability to predict the score will be gradually refined . Self training MSL Pseudo code such as algorithm 1.

# 2 experiment

## 2.1 Data sets and evaluation indicators

1）**ShanghaiTech** Is a containing 437 Campus monitoring 130 A campus incident 13 A medium-sized video data set of scenes . However , All the training data are normal . Under weak supervision , Use 238 Training videos and 199 Division of test videos .

2）**UCF-Crime** It's a large data set , contain 1900 Uncut real street and indoor surveillance video , contain 13 Class exception event , The total duration is 128 Hours . The training set contains 1610 A video with a video level tag , The test set contains 290 A video with frame level labels .

3）**XD-Violence** It's a large data set , contain 4754 An untrimmed video , The total duration is 217 Hours , And collect... From multiple sources , For example, movies 、 sports 、 Monitoring and CCTV . The training set contains 3954 A video with a video level tag , The test set contains 800 A video with frame level labels .

The evaluation indicators of the first two data sets are AUC and ROC, The latter dataset uses average precision (*AP*).

## 2.2 Implementation details

1） from Sports-1M Pre training on C3D Of fc6 Layer 4096D features ;

2） From pre training I3D Mixing 5c Layer 1024D features ;

3） stay Kinetics-400 As well as from Kinetics-400 Pre training VideoSwin Of Stage4 Layer extraction 1024D features ;

4）$T=32$、$K={32,16,8,4,1}$、$D=16$;

5） Optimizer usage SGD、 The learning rate is set to 0.001、 Weight falloff set to 0.0005、 The batch size is set to 64;

6）$E_{1}=100$、$E_{2}=400$;

7） Every mini-batch from 32 A random selection of normal and abnormal video components . In the exception video , Before random selection 10% As an exception fragment ;

8） stay CTE in ,headers The number of is set to 12, And use a kernel size of 3 Of DW Conv1D.

thank