### List of articles

# 0 introduce

## 0.1 subject

## 0.2 background

Learn by example (MIL) stay ** Pathological whole image ** (*histopathology whole slide images, WSIs*) The application of classification is becoming more and more mature . However , Such targeted research still faces some difficulties , Such as ** Small sample queue ** (*small sample cohorts*) . In this context ,WSI Images ( package ) Limited number , And leaflet WSI The resolution is huge , Further lead to a large number of cropped blocks ( example ).

Tips： I wonder if the expression of the small sample queue is accurate ; I downloaded it before WSI Images , A single sample may have more than one G, It's really scary

## 0.3 Method

By introducing ** Pseudo packet ** (*pseudo-bags*) To virtually increase the number of packages , On this basis, a ** double ** (*double-tier*) MIL Framework to make effective use of its inherent characteristics . Besides , In attention based MIL The calculation example probability is deduced under the framework , The derivation is used to help build and analyze the proposed framework .

## 0.4 Bib

```
@inproceedings{
Zhang:2022:double,
author = {
Hong Run Zhang and Yan Da Meng and Yi Tian Zhao and Yi Hong Qiao and Xiao Yun Yang and Sarah E Coupland and Ya Lin Zheng},
title = {
{
DTFD}-{
MIL}: {
D}ouble-tier feature distillation multiple instance learning for histopathology whole slide image classification},
journal = {
Computer Vision and Pattern Recognition},
year = {
2022}
}
```

# 1 introduce

** Whole image ** (*WSI*) Annotation is one of the major challenges in the field of computer vision , It is widely used in histopathology , It promotes the improvement of digital pathology on pathologists' workflow and diagnostic decision-making , It also stimulates the understanding of WSI Requirements for intelligent or automatic analysis tools . Single sheet WSI It's too big , from 100M To 10G Unequal . Because of this unique nature , Existing machine learning methods , For example, it is unrealistic to use natural images and medical image models directly ; Deep learning models require large-scale data and high-quality annotations .But, Pixel level label pairs WSI It can only be (￣▽￣)". So , In this way ** Small amount of annotation ** The question has aroused the great enthusiasm of researchers of deep learning , Such as weak supervision and semi supervision , And most of the weak supervision WSI Research can be characterized as MIL Research . stay MIL Within the framework of , One WSI As a package , It can contain thousands of blocks ( example ). As long as at least one instance is positive , Then WSI Being positive .

In the field of computer vision , There are many ways to MIL Try the problem . However ,WSI The innate nature of determines MIL Under the WSI The classification scheme is not as simple as other computer vision sub fields , because ** The only direct guidance information for training is hundreds of WSI The label of **. This can lead to over fitting problems , That is, the machine learning model tends to fall into local minimum in the optimization process , The correlation between the learned characteristics and the target disease is low , So as to reduce the generalization ability of the model .

To solve the over fitting problem ,MIL Next WSI The guiding ideology of the research is to learn more information from fewer tags .** Mutual example relationship ** (*mutual-instance relation*) Is one of the effective methods , Can be specified as space or feature distance , Or learn through neural modules , Such as cyclic neural network 、 converter , And graph neural networks .

Most of the existing methods can be classified as ** Based on the attention mechanism ** Of (*attention-based, AB_MIL*), The main difference is in the calculation of attention score . However , stay AB-MIL It is considered infeasible to explicitly infer the instance probability under the framework , And as an alternative , Attention scores are often used as an indicator of positive activation . In this paper , We think ** Attention score ** Not a strict measure for this purpose , But in AB-MIL Derived under the framework ** Instance probability **.

Given an oversize WSI, the ** Direct processing unit ** It's from WSI Polar blocks cropped in . by WSI Born MIL The purpose of the model is to identify the most distinctive blocks , Because it is most likely to trigger the tag of the package . However ,**WSI There are fewer of them , There are countless blocks , And the label information is WSI Grade **. Besides , Pathology WSI in , Positive examples corresponding to the lesion area often occupy only a small part of the tissue , This further leads to a very small number of positive instances . therefore , In cases where over fitting is most likely to result , It's still exciting to identify these positive examples

In recent years , Although there are many ways to use ** Mutual example information ** To enhance MIL performance , But they did not explicitly address the above reasons WSI Problems caused by essential characteristics . In order to alleviate the negative effects of these problems , We introduce... Into the algorithm framework ** Pseudo packet ** The concept of , That is, randomly divide a sheet WSI Examples in , The partition result corresponds to the pseudo packet . Each pseudo package will be assigned its parent package , That is, the label of the original package . This method can organically increase the number of packages , And ensure that there are only a few instances in the pseudo package , This is us ** Double layer characteristic distillation MIL Model ** Great idea of , Such as chart 1. In particular , One 1 Hierarchy AB-MIL The model is applied to all WSI In the pseudo package . However , There is one ** Risk issues ** It is a pseudo package from a positive package. In fact, there may be no positive instances in the pseudo package , In this way, it is assigned a wrong label .

Old fellow iron

To solve this problem , We distill an eigenvector from each pseudo packet , And build a vector like this 2 Hierarchy AB-MIL Model , Such as chart 3. After such distillation ,1 The hierarchical model will provide clear features , In order to offer 2 The hierarchical model obtains a better representation of the parent package . Besides , For characteristic distillation , We use deep learning features for visualization **Grad-CAM** (** Gradient based category activation graph **, *grad-based class activation map*) The basic idea of the model , stay AB-MIL Within the framework of ** The instance probability is derived **.

Essentially , Let's look at it from a novel perspective , That is, use double layers MIL Frame to deal with WSI problem , The main ** The contributions are as follows **：

1） Introduce the concept of pseudo package , In response to WSI Insufficient dilemma ;

2） utilize Grad-CAM The basic idea of , from AB-MIL From the point of view of, the instance probability is directly derived , This can be used as a lot in the future MIL Extension of method ;

3） Push the probability to , Developed a two-tier MIL frame , And in two large public WSI The data set shows its advantages .

# 2 Method

## 2.1 review Grad-CAM and AB-MIL

### 2.1.1 Grad-CAM

An end-to-end deep learning image classification model usually includes two modules , That is, for high-level feature extraction ** Deep convolution network ** (*deep convolution neural network, DCNN*) And for classification ** Multi layer perception ** (*multi-layer perceptron, MLP*). An image is fed to DCNN Multiple feature maps can be obtained after , And an eigenvector can be obtained through the pooling function . In this way, the eigenvector is handed over to MLP, You can get the category probability 🤩, Such as chart 2 (a).

hypothesis DCNN The output characteristic diagram is $U∈R_{D×W×H}$, among $D$ Number of channels ,$D$ and $H$ Is the dimension size . stay $U$ Applying global average pooling on the packet will obtain the eigenvector representing the packet ：

$f=GAP_{W,H}(U)∈R_{D}(1)$ among $GAP_{W,H}(U)$ About $W,H$ Average pooling of , namely $f$ Of the $d$ Elements $f_{d}=WH1 ∑_{w=1,h=1}U_{w,h}$. Use $f$ As input ,MLP Export category $c∈{1,2,c…,C}$ The logical value of $s_{c}$, It indicates that the current attribute belongs to $c$ Class signal strength , It can be done by softmax Operation to obtain the predicted category probability . be based on Grad-CAM Of the $c$ Class category activation graph is defined as the weighted sum of feature graph ：

$L_{c}=d∑D β_{d}U_{d},β_{d}=WH1 w,h∑W,H (∂U_{w,h}∂s_{c} )(2)$ among $L_{c}∈R_{W×H}$,$L_{w,h}$ yes $L_{c}$ It's in position $w,h$ Amplitude value of , Indicates that this position converges to the category $c$ Intensity of ：

$L_{w,h}=d=1∑D β_{d}U_{w,h}(3)$

### 2.1.2 AB-MIL

Given that there is $K$ Package of instances $X={x_{1},x_{2},…,x_{K}}$, Each instance $x_{k},k∈1,2,…,K$ Hold hidden Tags $y_{k}$ ( Unknowable ), among $y_{k}=1$ Express positive ,$=0$ Negative .MIL The goal of is to detect whether the package contains at least one positive instance . The only thing you can use during the training phase is ** Package label **, It is defined as ：

$Y={1,0, if∑_{k=1}y_{k}>0otherwise (4)$ A simple way to solve this problem is to assign the label of the corresponding package to the instance , And train the classifier , Finally, through average pooling or maximum pooling, the predicted result of aggregation instances is packet labels . Another strategy is to use the learning package to express $F$, Thus, the problem is simplified to the traditional classification task . This strategy is more effective , It can be seen as MIL Embedded learning is a kind of .** Packet embedding ** Is customized as ：

$F=G({h_{k}∣k=1,2,…,K})(5)$ among $G$ Is the aggregation function ,$h_{k}∈R_{d}$ Is the instance $k$ Feature extraction . The typical convergence function is the attention mechanism ：

$F=k=1∑K α_{k}h_{k}∈R_{D}(6)$ among $α_{k}$ Is the instance $h_{k}$ Acquisition weight of ,$D$ It's a vector $F$ and $h_{k}$ Dimensions . Such a mechanism, such as chart 2 (b) Shown . There are many ways to calculate attention scores , For example, classic AB-MIL The weight of is calculated as ：

$α_{k}=∑_{j=1}exp{w_{T}(tanh(V_{1}h_{j})⊙sigm(V_{2}h_{j}))}exp{w_{T}(tanh(V_{1}h_{k})⊙sigm(V_{2}h_{k}))} (7)$ among $w$、$V_{1}$, as well as $V_{2}$ Is the acquisition parameter .

## 2.2 AB-MIL Derivation of case probability in

Even though MIL The packet embedding method has excellent performance , However, it seems infeasible to calculate the probability of instance category . This paper proves that in AB-MIL It is feasible to obtain the prediction probability of a single instance , Prove slightly . therefore , application Grad-CAM To AB-MIL It is feasible to directly infer the signal strength of an instance belonging to a certain category . And formula 2 similar , example $k$ Belong to category $c$ Of ** Signal strength ** Can be recorded as ：

$L_{k}=d=1∑D β_{d}h^_{k,d},β_{d}=K1 i=1∑K ∂h^_{k,d}∂s_{c} (8)$ among $s_{c}$ yes MIL Classifiers about categories $c$ Output logic of 、$h^_{k,d}$ yes $h^_{k}$ The elements of , as well as $h^_{k}=α_{k}Kh_{k}$. By using softmax function , Instance belongs to the third $c$ The prediction probability of is ：

$p_{k}=∑_{t=1}exp(L_{k})exp(L_{k}) (9)$

## 2.3 Double layer characteristic distillation MIL

Given $N$ A package (WSI), Each bag has $K_{n}$ An example , namely $X_{n}={x_{n,k}∣k=1,2,…,K_{n}},n∈{1,2,…,N}$,$Y_{n}$ Represents the real label of the package . The characteristics corresponding to each instance are recorded as $h_{n,k}$, It is composed of neural network $H$ extract , namely $h_{n,k}=H(x_{n,k})$. The instances in each package are randomly divided into $M$ Pseudo packets , The number of instances in the package is roughly even ,$X_{n}={X_{n}∣m=1,2,…,M}$. The label of the pseudo package is marked as the label of its parent package , namely $Y_{n}=Y_{n}$.1 Hierarchy AB-MIL Model record $T_{1}$, Used to process each pseudo packet , Then each pseudo packet passes $T_{1}$ The packet probability obtained is ：

$y_{n}=T_{1}({h_{k}=H(x_{k})∣x_{k}∈X_{n}})(10)$$T_{1}$ The loss function of the layer is defined based on cross entropy ：

$L_{1}=−frac1MNn=1,m=1∑N,M Y_{n}gy_{n}+(1−Y_{n})g(1−y_{n})(11)$ Then, the probability of each instance in the pseudo packet passes through the formula 8–9 get . Case based probability , The eigenvector of each pseudo packet can be obtained , Among them the first $n$ Number of packages $m$ The distillation result of a pseudo package is expressed as $f^ _{n}$. All distillation results are passed on to 2 Hierarchy AB-MIL $T_{2}$, The result is the inference of each package label ：

$y^ _{n}=T_{2}({f^ _{n}∣m∈(1,2,…,M)})(12)$$T_{2}$ The loss of is defined as ：

$L_{2}=N1 n=1∑N Y_{n}gy^ _{n}+(1−Y_{n})g(1−y^ _{n})(13)$ Classified ** Total loss ** by ：

$L=θ_{1}argmin L_{1}+θ_{2}argmin L_{2}(14)$ among $θ_{1}$ and $θ_{2}$ It's network parameters .

It should be noted that there are a large number of noise tags in the pseudo packet , Random partitioning does not guarantee that every positive and pseudo packet contains at least one positive instance . Deep learning has a tolerance for noise labels . Besides ,** The noise level can be roughly the same as $M$ hook **, Ablation experiments will then be used to evaluate $M$ Impact on final performance .

Four characteristic distillation strategies will be considered ：**MaxS** (*maximum selection*)：$T_{1}$ After processing , The characteristics of instances with maximum positive probability in pseudo packets are passed to $T_{2}$;**MaxMinS** (maxMin selection)： Choose two ;**MAS** (*maximum attention score selection*)： Choose the one with the largest attention score ;**AFS** (*aggregated feature selection*)： Through the formula 6 Converge .

thank