Paper reading (47):dtfd-mil: double tier feature interpretation multiple instance learning for histopathology

Inge2022-06-23 18:03:22

0 introduce

0.1 subject

CVPR2022: Double layer feature distillation multi instance learning for whole image classification of histopathology (DTFD-MIL: Double-tier feature distillation multiple Instance learning for histopathology whole slide image classification)

0.2 background

Learn by example (MIL) stay Pathological whole image (histopathology whole slide images, WSIs) The application of classification is becoming more and more mature . However , Such targeted research still faces some difficulties , Such as Small sample queue (small sample cohorts) . In this context ,WSI Images ( package ) Limited number , And leaflet WSI The resolution is huge , Further lead to a large number of cropped blocks ( example ).
Tips: I wonder if the expression of the small sample queue is accurate ; I downloaded it before WSI Images , A single sample may have more than one G, It's really scary

0.3 Method

By introducing Pseudo packet (pseudo-bags) To virtually increase the number of packages , On this basis, a double (double-tier) MIL Framework to make effective use of its inherent characteristics . Besides , In attention based MIL The calculation example probability is deduced under the framework , The derivation is used to help build and analyze the proposed framework .

0.4 Bib

author = {
Hong Run Zhang and Yan Da Meng and Yi Tian Zhao and Yi Hong Qiao and Xiao Yun Yang and Sarah E Coupland and Ya Lin Zheng},
title = {
MIL}: {
D}ouble-tier feature distillation multiple instance learning for histopathology whole slide image classification},
journal = {
Computer Vision and Pattern Recognition},
year = {

1 introduce

Whole image (WSI) Annotation is one of the major challenges in the field of computer vision , It is widely used in histopathology , It promotes the improvement of digital pathology on pathologists' workflow and diagnostic decision-making , It also stimulates the understanding of WSI Requirements for intelligent or automatic analysis tools . Single sheet WSI It's too big , from 100M To 10G Unequal . Because of this unique nature , Existing machine learning methods , For example, it is unrealistic to use natural images and medical image models directly ; Deep learning models require large-scale data and high-quality annotations .But, Pixel level label pairs WSI It can only be ( ̄▽ ̄)". So , In this way Small amount of annotation The question has aroused the great enthusiasm of researchers of deep learning , Such as weak supervision and semi supervision , And most of the weak supervision WSI Research can be characterized as MIL Research . stay MIL Within the framework of , One WSI As a package , It can contain thousands of blocks ( example ). As long as at least one instance is positive , Then WSI Being positive .
In the field of computer vision , There are many ways to MIL Try the problem . However ,WSI The innate nature of determines MIL Under the WSI The classification scheme is not as simple as other computer vision sub fields , because The only direct guidance information for training is hundreds of WSI The label of . This can lead to over fitting problems , That is, the machine learning model tends to fall into local minimum in the optimization process , The correlation between the learned characteristics and the target disease is low , So as to reduce the generalization ability of the model .
To solve the over fitting problem ,MIL Next WSI The guiding ideology of the research is to learn more information from fewer tags . Mutual example relationship (mutual-instance relation) Is one of the effective methods , Can be specified as space or feature distance , Or learn through neural modules , Such as cyclic neural network 、 converter , And graph neural networks .
Most of the existing methods can be classified as Based on the attention mechanism Of (attention-based, AB_MIL), The main difference is in the calculation of attention score . However , stay AB-MIL It is considered infeasible to explicitly infer the instance probability under the framework , And as an alternative , Attention scores are often used as an indicator of positive activation . In this paper , We think Attention score Not a strict measure for this purpose , But in AB-MIL Derived under the framework Instance probability .
Given an oversize WSI, the Direct processing unit It's from WSI Polar blocks cropped in . by WSI Born MIL The purpose of the model is to identify the most distinctive blocks , Because it is most likely to trigger the tag of the package . However ,WSI There are fewer of them , There are countless blocks , And the label information is WSI Grade . Besides , Pathology WSI in , Positive examples corresponding to the lesion area often occupy only a small part of the tissue , This further leads to a very small number of positive instances . therefore , In cases where over fitting is most likely to result , It's still exciting to identify these positive examples

In recent years , Although there are many ways to use Mutual example information To enhance MIL performance , But they did not explicitly address the above reasons WSI Problems caused by essential characteristics . In order to alleviate the negative effects of these problems , We introduce... Into the algorithm framework Pseudo packet The concept of , That is, randomly divide a sheet WSI Examples in , The partition result corresponds to the pseudo packet . Each pseudo package will be assigned its parent package , That is, the label of the original package . This method can organically increase the number of packages , And ensure that there are only a few instances in the pseudo package , This is us Double layer characteristic distillation MIL Model Great idea of , Such as chart 1. In particular , One 1 Hierarchy AB-MIL The model is applied to all WSI In the pseudo package . However , There is one Risk issues It is a pseudo package from a positive package. In fact, there may be no positive instances in the pseudo package , In this way, it is assigned a wrong label .
Old fellow iron

chart 1: The proposed method is different from the traditional method MIL Different

To solve this problem , We distill an eigenvector from each pseudo packet , And build a vector like this 2 Hierarchy AB-MIL Model , Such as chart 3. After such distillation ,1 The hierarchical model will provide clear features , In order to offer 2 The hierarchical model obtains a better representation of the parent package . Besides , For characteristic distillation , We use deep learning features for visualization Grad-CAM ( Gradient based category activation graph , grad-based class activation map) The basic idea of the model , stay AB-MIL Within the framework of The instance probability is derived .

chart 3:DTFD-MIL General framework . The collection of some instances starts with WSI Crop in organization area , Here are only nine . All these instances will be further divided , obtain M ( Such as 3) Pseudo packets .1 Hierarchy AB-MIL Get the eigenvectors of all pseudo packets , And as a 2 Hierarchy AB-MIL The input of . The real label of the package is used to supervise the prediction label of the two-layer model

Essentially , Let's look at it from a novel perspective , That is, use double layers MIL Frame to deal with WSI problem , The main The contributions are as follows
1) Introduce the concept of pseudo package , In response to WSI Insufficient dilemma ;
2) utilize Grad-CAM The basic idea of , from AB-MIL From the point of view of, the instance probability is directly derived , This can be used as a lot in the future MIL Extension of method ;
3) Push the probability to , Developed a two-tier MIL frame , And in two large public WSI The data set shows its advantages .

2 Method

2.1 review Grad-CAM and AB-MIL

2.1.1 Grad-CAM

An end-to-end deep learning image classification model usually includes two modules , That is, for high-level feature extraction Deep convolution network (deep convolution neural network, DCNN) And for classification Multi layer perception (multi-layer perceptron, MLP). An image is fed to DCNN Multiple feature maps can be obtained after , And an eigenvector can be obtained through the pooling function . In this way, the eigenvector is handed over to MLP, You can get the category probability 🤩, Such as chart 2 (a).

chart 2:(A) Description of deep learning image classification model . Global average pooling is used to extract the feature map of the whole image , And further obtain the eigenvector . Eigenvectors are passed to MLP Get category probability .(B) AB-MIL explain . The extracted features of the instance are weighted by the attention score , The weighted average results of all instances are used as a new representation of the package , And then turn to MLP Output packet prediction .

hypothesis DCNN The output characteristic diagram is U ∈ R D × W × H U\in\mathbb{R}^{D\times W\times H} URD×W×H, among D D D Number of channels , D D D and H H H Is the dimension size . stay U U U Applying global average pooling on the packet will obtain the eigenvector representing the packet :
f = GAP W , H ( U ) ∈ R D (1) \tag{1} \boldsymbol{f}=\text{GAP}_{W,H}(U)\in\mathbb{R}^D f=GAPW,H(U)RD(1) among GAP W , H ( U ) \text{GAP}_{W,H}(U) GAPW,H(U) About W , H W,H W,H Average pooling of , namely f \boldsymbol{f} f Of the d d d Elements f d = 1 W H ∑ w = 1 , h = 1 W , H U w , h d f_d=\frac{1}{WH}\sum_{w=1,h=1}^{W,H}U_{w,h}^d fd=WH1w=1,h=1W,HUw,hd. Use f \boldsymbol{f} f As input ,MLP Export category c ∈ { 1 , 2 , c … , C } c\in\{1,2,c\dots,C\} c{ 1,2,c,C} The logical value of s c s^c sc, It indicates that the current attribute belongs to c c c Class signal strength , It can be done by softmax Operation to obtain the predicted category probability . be based on Grad-CAM Of the c c c Class category activation graph is defined as the weighted sum of feature graph :
L c = ∑ d D β d c U d , β d c = 1 W H ∑ w , h W , H ( ∂ s c ∂ U w , h d ) (2) \tag{2} \boldsymbol{L}^c=\sum_{d}^D\beta_d^cU^d,\qquad\beta_d^c=\frac{1}{WH}\sum_{w,h}^{W,H}\left( \frac{\partial s^c}{\partial U_{w,h}^d} \right) Lc=dDβdcUd,βdc=WH1w,hW,H(Uw,hdsc)(2) among L c ∈ R W × H \boldsymbol{L}^c\in\mathbb{R}^{W\times H} LcRW×H, L w , h c L_{w,h}^c Lw,hc yes L c \boldsymbol{L}^c Lc It's in position w , h w,h w,h Amplitude value of , Indicates that this position converges to the category c c c Intensity of :
L w , h c = ∑ d = 1 D β d c U w , h d (3) \tag{3} L_{w,h}^c=\sum_{d=1}^D\beta_d^cU_{w,h}^d Lw,hc=d=1DβdcUw,hd(3)

2.1.2 AB-MIL

Given that there is K K K Package of instances X = { x 1 , x 2 , … , x K } X=\{x_1,x_2,\dots,x_K\} X={ x1,x2,,xK}, Each instance x k , k ∈ 1 , 2 , … , K x_k,k\in1,2,\dots,K xk,k1,2,,K Hold hidden Tags y k y_k yk ( Unknowable ), among y k = 1 y_k=1 yk=1 Express positive , = 0 =0 =0 Negative .MIL The goal of is to detect whether the package contains at least one positive instance . The only thing you can use during the training phase is Package label , It is defined as :
Y = { 1 , if  ∑ k = 1 K y k > 0 0 , otherwise (4) \tag{4} Y=\left\{ \begin{array}{ll} 1,&\qquad \text{if}\ \sum_{k=1}^Ky_k>0\\ 0,&\qquad\text{otherwise} \end{array} \right. Y={ 1,0,if k=1Kyk>0otherwise(4) A simple way to solve this problem is to assign the label of the corresponding package to the instance , And train the classifier , Finally, through average pooling or maximum pooling, the predicted result of aggregation instances is packet labels . Another strategy is to use the learning package to express F \boldsymbol{F} F, Thus, the problem is simplified to the traditional classification task . This strategy is more effective , It can be seen as MIL Embedded learning is a kind of . Packet embedding Is customized as :
F = G ( { h k ∣ k = 1 , 2 , … , K } ) (5) \tag{5} \boldsymbol{F}=\text{G}(\{\boldsymbol{h_k|k=1,2,\dots,K}\}) F=G({ hkk=1,2,,K})(5) among G \text{G} G Is the aggregation function , h k ∈ R d \boldsymbol{h}_k\in\mathbb{R}^d hkRd Is the instance k k k Feature extraction . The typical convergence function is the attention mechanism :
F = ∑ k = 1 K α k h k ∈ R D (6) \tag{6} \boldsymbol{F}=\sum_{k=1}^K\alpha_k\boldsymbol{h}_k\in\mathbb{R}^D F=k=1KαkhkRD(6) among α k \alpha_k αk Is the instance h k \boldsymbol{h}_k hk Acquisition weight of , D D D It's a vector F \boldsymbol{F} F and h k \boldsymbol{h}_k hk Dimensions . Such a mechanism, such as chart 2 (b) Shown . There are many ways to calculate attention scores , For example, classic AB-MIL The weight of is calculated as :
α k = exp ⁡ { w T ( tanh ⁡ ( V 1 h k ) ⊙ sigm ( V 2 h k ) ) } ∑ j = 1 K exp ⁡ { w T ( tanh ⁡ ( V 1 h j ) ⊙ sigm ( V 2 h j ) ) } (7) \tag{7} \alpha_k=\frac{\exp\{ \boldsymbol{w}^T(\tanh (\boldsymbol{V}_1\boldsymbol{h}_k) \odot\text{sigm}(\boldsymbol{V}_2\boldsymbol{h}_k)) \}}{\sum_{j=1}^K\exp\{ \boldsymbol{w}^T(\tanh (\boldsymbol{V}_1\boldsymbol{h}_j) \odot\text{sigm}(\boldsymbol{V}_2\boldsymbol{h}_j)) \}} αk=j=1Kexp{ wT(tanh(V1hj)sigm(V2hj))}exp{ wT(tanh(V1hk)sigm(V2hk))}(7) among w \boldsymbol{w} w V 1 \boldsymbol{V}_1 V1, as well as V 2 \boldsymbol{V}_2 V2 Is the acquisition parameter .

2.2 AB-MIL Derivation of case probability in

Even though MIL The packet embedding method has excellent performance , However, it seems infeasible to calculate the probability of instance category . This paper proves that in AB-MIL It is feasible to obtain the prediction probability of a single instance , Prove slightly . therefore , application Grad-CAM To AB-MIL It is feasible to directly infer the signal strength of an instance belonging to a certain category . And formula 2 similar , example k k k Belong to category c c c Of Signal strength Can be recorded as :
L k c = ∑ d = 1 D β d c h ^ k , d , β d c = 1 K ∑ i = 1 K ∂ s c ∂ h ^ k , d (8) \tag{8} L_k^c=\sum_{d=1}^D\beta_d^c\hat{h}_{k,d},\qquad\beta_{d}^c=\frac{1}{K}\sum_{i=1}^K\frac{\partial s_c}{\partial\hat{h}_{k,d}} Lkc=d=1Dβdch^k,d,βdc=K1i=1Kh^k,dsc(8) among s c s_c sc yes MIL Classifiers about categories c c c Output logic of 、 h ^ k , d \hat{h}_{k,d} h^k,d yes h ^ k \hat{\boldsymbol{h}}_k h^k The elements of , as well as h ^ k = α k K h k \hat{\boldsymbol{h}}_k=\alpha_kK\boldsymbol{h}_k h^k=αkKhk. By using softmax function , Instance belongs to the third c c c The prediction probability of is :
p k c = exp ⁡ ( L k c ) ∑ t = 1 C exp ⁡ ( L k t ) (9) \tag{9} p_k^c=\frac{\exp(L_k^c)}{\sum_{t=1}^C\exp(L_k^t)} pkc=t=1Cexp(Lkt)exp(Lkc)(9)

2.3 Double layer characteristic distillation MIL

Given N N N A package (WSI), Each bag has K n K_n Kn An example , namely X n = { x n , k ∣ k = 1 , 2 , … , K n } , n ∈ { 1 , 2 , … , N } \boldsymbol{X}_n=\{ x_{n,k} | k=1,2,\dots,K_n\},n\in\{ 1,2,\dots,N \} Xn={ xn,kk=1,2,,Kn},n{ 1,2,,N}, Y n Y_n Yn Represents the real label of the package . The characteristics corresponding to each instance are recorded as h n , k \boldsymbol{h}_{n,k} hn,k, It is composed of neural network H \mathbf{H} H extract , namely h n , k = H ( x n , k ) \boldsymbol{h}_{n,k}=\boldsymbol{H}(x_{n,k}) hn,k=H(xn,k). The instances in each package are randomly divided into M M M Pseudo packets , The number of instances in the package is roughly even , X n = { X n m ∣ m = 1 , 2 , … , M } \boldsymbol{X}_n=\{ \boldsymbol{X}_n^m | m = 1,2,\dots,M \} Xn={ Xnmm=1,2,,M}. The label of the pseudo package is marked as the label of its parent package , namely Y n m = Y n Y_n^m=Y_n Ynm=Yn.1 Hierarchy AB-MIL Model record T 1 \text{T}_1 T1, Used to process each pseudo packet , Then each pseudo packet passes T 1 \text{T}_1 T1 The packet probability obtained is :
y n m = T 1 ( { h k = H ( x k ) ∣ x k ∈ X n m } ) (10) \tag{10} y_n^m=\text{T}_1(\{ \boldsymbol{h}_k = \mathbf{H}(x_k)|x_k\in\boldsymbol{X}_n^m \}) ynm=T1({ hk=H(xk)xkXnm})(10) T 1 \text{T}_1 T1 The loss function of the layer is defined based on cross entropy :
L 1 = − f r a c 1 M N ∑ n = 1 , m = 1 N , M Y n m log ⁡ y n m + ( 1 − Y n m ) log ⁡ ( 1 − y n m ) (11) \tag{11} \mathcal{L}_1=-frac{1}{MN}\sum_{n=1,m=1}^{N,M}Y_n^m\log y_n^m+(1-Y_n^m)\log(1-y_n^m) L1=frac1MNn=1,m=1N,MYnmlogynm+(1Ynm)log(1ynm)(11) Then, the probability of each instance in the pseudo packet passes through the formula 8–9 get . Case based probability , The eigenvector of each pseudo packet can be obtained , Among them the first n n n Number of packages m m m The distillation result of a pseudo package is expressed as f ^ n m \hat{\boldsymbol{f}}_n^m f^nm. All distillation results are passed on to 2 Hierarchy AB-MIL T 2 \text{T}_2 T2, The result is the inference of each package label :
y ^ n = T 2 ( { f ^ n m ∣ m ∈ ( 1 , 2 , … , M ) } ) (12) \tag{12} \hat{y}_n=\text{T}_2\left( \left\{ \hat{\boldsymbol{f}}_n^m | m \in (1,2,\dots,M) \right\} \right) y^n=T2({ f^nmm(1,2,,M)})(12) T 2 \text{T}_2 T2 The loss of is defined as :
L 2 = 1 N ∑ n = 1 N Y n log ⁡ y ^ n + ( 1 − Y n ) log ⁡ ( 1 − y ^ n ) (13) \tag{13} \mathcal{L}_2=\frac{1}{N}\sum_{n=1}^NY_n\log\hat{y}_n+(1-Y_n)\log(1-\hat{y}_n) L2=N1n=1NYnlogy^n+(1Yn)log(1y^n)(13) Classified Total loss by :
L = arg min ⁡ θ 1 L 1 + arg min ⁡ θ 2 L 2 (14) \tag{14} \mathcal{L}=\argmin_{\boldsymbol{\theta}_1}\mathcal{L}_1+\argmin_{\boldsymbol{\theta}_2}\mathcal{L}_2 L=θ1argminL1+θ2argminL2(14) among θ 1 \boldsymbol{\theta}_1 θ1 and θ 2 \boldsymbol{\theta}_2 θ2 It's network parameters .
It should be noted that there are a large number of noise tags in the pseudo packet , Random partitioning does not guarantee that every positive and pseudo packet contains at least one positive instance . Deep learning has a tolerance for noise labels . Besides , The noise level can be roughly the same as M M M hook , Ablation experiments will then be used to evaluate M M M Impact on final performance .
Four characteristic distillation strategies will be considered :
MaxS (maximum selection): T 1 \text{T}_1 T1 After processing , The characteristics of instances with maximum positive probability in pseudo packets are passed to T 2 \text{T}_2 T2;
MaxMinS (maxMin selection): Choose two ;
MAS (maximum attention score selection): Choose the one with the largest attention score ;
AFS (aggregated feature selection): Through the formula 6 Converge .

Similar articles