# Thesis reading (57):2-hydr_ Ensemble: lysine 2-hydroxyisobutyrylation identification with ensemble method (task)

Inge2022-06-23 18:04:50

# 1 introduce

## 1.1 subject

2021： Identification of lysine by integrated method 2- Hydroxyisobutyrylation (Lysine 2-hydroxyisobutyrylation identification with ensemble method)

## 1.2 summary

Lysine 2- Hydroxyisobutyrylation is a new type of post-translational modification detected in proteomics . This modification research may contribute to the research and drug development of a variety of diseases . In this work , A new 2-hydr_Ensemble Residue identification algorithm , This residue has sequence information at the protein level . This method is compared with typical classification models . Results show HeLa cells 、 Spirococcus 、 Rice seeds , And Saccharomyces cerevisiae AUC The values reach... Respectively 0.9197、0.8192、0.9307, as well as 0.8897. The statistical characteristics of Bayes with two profiles are further used , Find out the potential information from several eigenvectors .

## 1.3 Bib

@article{
Bao:2021:104351,
author = {
Wen Zheng Bao and Bin Yang and Bai Tong Chen},
title = {
2-hydr\_ensemble: Lysine 2-hydroxyisobutyrylation identification with ensemble method},
journal = {
Chemometrics and Intelligent Laboratory Systems},
volume = {
215},
pages = {
104351},
year = {
2021},
doi = {
10.1016/j.chemolab.2021.104351}
}


# 2 Method

## 2.1 Modified residues

### 2.1.1 The correlation coefficient (CC)

Pearson correlation coefficient is a linear correlation coefficient , It is generally used to correct the correlation between two variables of residues . For two gene sequences X X and Y Y , Pearson correlation coefficient The calculation is as follows ：
R X , Y = ∑ ( X − X ‾ ) ( Y − Y ‾ ) ∑ ( X − X ‾ ) 2 ( Y − Y ‾ ) 2 , (1) \tag{1} R_{X,Y}=\frac{\sum{(X-\overline{X})(Y-\overline{Y})}}{\sqrt{\sum{(X-\overline{X})^2(Y-\overline{Y})^2}}}, among X ‾ \overline{X} Express X X Average value .

### 2.1.2 Partial correlation coefficient (PCC)

Partial correlation coefficient is the correlation coefficient of two variables without the influence of other variables . Because the relationship between the two variables is very complex , May be affected by multiple variables , So the partial correlation coefficient is greater than CC Better choice .PCC According to its corresponding CC To define . Make R R Express CC matrix , Its inverse matrix is R − 1 R^{-1} , be PCC The calculation for the ：
R X , Y ′ = R X , Y − 1 R X , X − 1 R Y , Y − 1 (2) \tag{2} R_{X,Y}'=\frac{R_{X,Y}^{-1}}{\sqrt{R_{X,X}^{-1}R_{Y,Y}^{-1}}}

### 2.1.3 Conditional mutual information (CMI)

Mutual information (MI) It can measure the non-linear correlation between modified residues and unmodified residues ：
I ( X , Y ) = ∑ x ∈ X ∑ y ∈ Y p ( x , y ) log ⁡ p ( x , y ) p ( x ) p ( y ) , (3) \tag{3} I(X,Y)=\sum_{x\in X}\sum_{y\in Y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)}, among p ( x ) p(x) yes x x Probability 、 p ( x , y ) p(x,y) It's the joint probability , They can be obtained by Gaussian kernel probability density estimation ：
p ( x i ) = 1 N ∑ j = 1 N 1 ( 2 π ) n / 2 σ x n / 2 exp ⁡ ( − 1 2 ( X j − X i ) T C − 1 ( X j − X i ) ) , (4) \tag{4} p\left(x_{i}\right)=\frac{1}{N} \sum_{j=1}^{N} \frac{1}{(2 \pi)^{n / 2} \sigma_{x}^{n / 2}} \exp \left(-\frac{1}{2}\left(X_{j}-X_{i}\right)^{T} C^{-1}\left(X_{j}-X_{i}\right)\right), among C C Express X X The covariance matrix of 、 σ x \sigma_x Express C C Standard deviation , as well as n n and N N It represents the number of genes and the number of gene expression points respectively . therefore ,MI It can be calculated as ：
I ( X , Y ) = 1 2 log ⁡ ( σ X 2 σ Y 2 ∣ C ( X , Y ) ∣ ) , (5) \tag{5} I(X, Y)=\frac{1}{2} \log \left(\frac{\sigma_{X}^{2} \sigma_{Y}^{2}}{|C(X, Y)|}\right), among ∣ C ( X , Y ) ∣ |C(X, Y)| It's a determinant .
However ,MI There are high estimate problems . therefore , Conditional mutual information (CMI) Proposed ：
C M I ( X , Y ∣ Z ) = ∑ x ∈ X , y ∈ Y , z ∈ Z p ( x , y , z ) log ⁡ p ( x , y ∣ z ) p ( x ∣ z ) p ( y ∣ z ) . (6) \tag{6} CMI(X, Y |Z) = \sum_{x \in X, y \in Y, z \in Z} p(x, y, z) \log \frac{p(x, y \mid z)}{p(x \mid z) p(y \mid z)}. If two genes X X Y Y It's irrelevant , C M I ( X , Y ∣ Z ) = 0 CMI(X, Y|Z)=0 .

### 2.1.4 Maximum information coefficient (MIC)

Maximum information coefficient (MIC) It is used to measure the linear or nonlinear relationship between two variables , It does not need to make assumptions about the distribution of data . Given a binary set , Where the data elements are ordered tuples ( a , b ) (a, b) . G G It's a grid . a a and b b The maximum information gain of all meshes of size is calculated as ：
I ∗ ( D , a , b ) = max ⁡ I ( D ∣ G ) , (7) \tag{7} I^*(D,a,b)=\max I(D|_G), among I ( D ∣ G ) I(D|_G) Express D ∣ G D|_G Mutual information of , M ( D ) M(D) yes D D Characteristic matrix of , It is calculated as ：
M ( D ) a , b = I ∗ ( D , a , b ) log ⁡ ( min ⁡ ( a , b ) ) . (8) \tag{8} M(D)_{a,b}=\frac{I^*(D,a,b)}{\log(\min(a,b))}. max ⁡ ( M ( D ) ) \max(M(D)) It's genes a a and b b Between MIC, If the two genes are not related , Their MIC Will be equal to the 0.

## 2.2 Integration method

In order to improve the accuracy of detecting directly modified residues , A new dual integration method is proposed ：
1） Given contains m m Genes and n n Gene data set of sample points D D , Generate K K Data sets ( D 1 , D 2 , … , D k ) (D^1,D^2,\dots,D^k) ;
2） For datasets D i D^i ,CC、PCC、CMI, as well as MIC It is used to directly calculate the correlation between genes , And get a list of four ranks ( G C C i , G P C C i , G C M I i , G M I C i ) (G_{CC}^i,G_{PCC}^i,G_{CMI}^i,G_{MIC}^i) , And integrate into G i G^i ;
3） Generate ( G 1 , G 2 , … , G k ) (G^1,G^2,\dots,G^k) ;
4） Integrated as G G .

thank