Thesis reading (57):2-hydr_ Ensemble: lysine 2-hydroxyisobutyrylation identification with ensemble method (task)

Inge2022-06-23 18:04:50

1 introduce

1.1 subject

2021: Identification of lysine by integrated method 2- Hydroxyisobutyrylation (Lysine 2-hydroxyisobutyrylation identification with ensemble method)

1.2 summary

Lysine 2- Hydroxyisobutyrylation is a new type of post-translational modification detected in proteomics . This modification research may contribute to the research and drug development of a variety of diseases . In this work , A new 2-hydr_Ensemble Residue identification algorithm , This residue has sequence information at the protein level . This method is compared with typical classification models . Results show HeLa cells 、 Spirococcus 、 Rice seeds , And Saccharomyces cerevisiae AUC The values reach... Respectively 0.9197、0.8192、0.9307, as well as 0.8897. The statistical characteristics of Bayes with two profiles are further used , Find out the potential information from several eigenvectors .

1.3 Bib

@article{
Bao:2021:104351,
author = {
Wen Zheng Bao and Bin Yang and Bai Tong Chen},
title = {
2-hydr\_ensemble: Lysine 2-hydroxyisobutyrylation identification with ensemble method},
journal = {
Chemometrics and Intelligent Laboratory Systems},
volume = {
215},
pages = {
104351},
year = {
2021},
doi = {
10.1016/j.chemolab.2021.104351}
}

2 Method

2.1 Modified residues

2.1.1 The correlation coefficient (CC)

Pearson correlation coefficient is a linear correlation coefficient , It is generally used to correct the correlation between two variables of residues . For two gene sequences X X X and Y Y Y, Pearson correlation coefficient The calculation is as follows :
R X , Y = ∑ ( X − X ‾ ) ( Y − Y ‾ ) ∑ ( X − X ‾ ) 2 ( Y − Y ‾ ) 2 , (1) \tag{1} R_{X,Y}=\frac{\sum{(X-\overline{X})(Y-\overline{Y})}}{\sqrt{\sum{(X-\overline{X})^2(Y-\overline{Y})^2}}}, RX,Y=(XX)2(YY)2(XX)(YY),(1) among X ‾ \overline{X} X Express X X X Average value .

2.1.2 Partial correlation coefficient (PCC)

Partial correlation coefficient is the correlation coefficient of two variables without the influence of other variables . Because the relationship between the two variables is very complex , May be affected by multiple variables , So the partial correlation coefficient is greater than CC Better choice .PCC According to its corresponding CC To define . Make R R R Express CC matrix , Its inverse matrix is R − 1 R^{-1} R1, be PCC The calculation for the :
R X , Y ′ = R X , Y − 1 R X , X − 1 R Y , Y − 1 (2) \tag{2} R_{X,Y}'=\frac{R_{X,Y}^{-1}}{\sqrt{R_{X,X}^{-1}R_{Y,Y}^{-1}}} RX,Y=RX,X1RY,Y1RX,Y1(2)

2.1.3 Conditional mutual information (CMI)

Mutual information (MI) It can measure the non-linear correlation between modified residues and unmodified residues :
I ( X , Y ) = ∑ x ∈ X ∑ y ∈ Y p ( x , y ) log ⁡ p ( x , y ) p ( x ) p ( y ) , (3) \tag{3} I(X,Y)=\sum_{x\in X}\sum_{y\in Y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)}, I(X,Y)=xXyYp(x,y)logp(x)p(y)p(x,y),(3) among p ( x ) p(x) p(x) yes x x x Probability 、 p ( x , y ) p(x,y) p(x,y) It's the joint probability , They can be obtained by Gaussian kernel probability density estimation :
p ( x i ) = 1 N ∑ j = 1 N 1 ( 2 π ) n / 2 σ x n / 2 exp ⁡ ( − 1 2 ( X j − X i ) T C − 1 ( X j − X i ) ) , (4) \tag{4} p\left(x_{i}\right)=\frac{1}{N} \sum_{j=1}^{N} \frac{1}{(2 \pi)^{n / 2} \sigma_{x}^{n / 2}} \exp \left(-\frac{1}{2}\left(X_{j}-X_{i}\right)^{T} C^{-1}\left(X_{j}-X_{i}\right)\right), p(xi)=N1j=1N(2π)n/2σxn/21exp(21(XjXi)TC1(XjXi)),(4) among C C C Express X X X The covariance matrix of 、 σ x \sigma_x σx Express C C C Standard deviation , as well as n n n and N N N It represents the number of genes and the number of gene expression points respectively . therefore ,MI It can be calculated as :
I ( X , Y ) = 1 2 log ⁡ ( σ X 2 σ Y 2 ∣ C ( X , Y ) ∣ ) , (5) \tag{5} I(X, Y)=\frac{1}{2} \log \left(\frac{\sigma_{X}^{2} \sigma_{Y}^{2}}{|C(X, Y)|}\right), I(X,Y)=21log(C(X,Y)σX2σY2),(5) among ∣ C ( X , Y ) ∣ |C(X, Y)| C(X,Y) It's a determinant .
However ,MI There are high estimate problems . therefore , Conditional mutual information (CMI) Proposed :
C M I ( X , Y ∣ Z ) = ∑ x ∈ X , y ∈ Y , z ∈ Z p ( x , y , z ) log ⁡ p ( x , y ∣ z ) p ( x ∣ z ) p ( y ∣ z ) . (6) \tag{6} CMI(X, Y |Z) = \sum_{x \in X, y \in Y, z \in Z} p(x, y, z) \log \frac{p(x, y \mid z)}{p(x \mid z) p(y \mid z)}. CMI(X,YZ)=xX,yY,zZp(x,y,z)logp(xz)p(yz)p(x,yz).(6) If two genes X X X Y Y Y It's irrelevant , C M I ( X , Y ∣ Z ) = 0 CMI(X, Y|Z)=0 CMI(X,YZ)=0.

2.1.4 Maximum information coefficient (MIC)

Maximum information coefficient (MIC) It is used to measure the linear or nonlinear relationship between two variables , It does not need to make assumptions about the distribution of data . Given a binary set , Where the data elements are ordered tuples ( a , b ) (a, b) (a,b). G G G It's a grid . a a a and b b b The maximum information gain of all meshes of size is calculated as :
I ∗ ( D , a , b ) = max ⁡ I ( D ∣ G ) , (7) \tag{7} I^*(D,a,b)=\max I(D|_G), I(D,a,b)=maxI(DG),(7) among I ( D ∣ G ) I(D|_G) I(DG) Express D ∣ G D|_G DG Mutual information of , M ( D ) M(D) M(D) yes D D D Characteristic matrix of , It is calculated as :
M ( D ) a , b = I ∗ ( D , a , b ) log ⁡ ( min ⁡ ( a , b ) ) . (8) \tag{8} M(D)_{a,b}=\frac{I^*(D,a,b)}{\log(\min(a,b))}. M(D)a,b=log(min(a,b))I(D,a,b).(8) max ⁡ ( M ( D ) ) \max(M(D)) max(M(D)) It's genes a a a and b b b Between MIC, If the two genes are not related , Their MIC Will be equal to the 0.

2.2 Integration method

In order to improve the accuracy of detecting directly modified residues , A new dual integration method is proposed :
1) Given contains m m m Genes and n n n Gene data set of sample points D D D, Generate K K K Data sets ( D 1 , D 2 , … , D k ) (D^1,D^2,\dots,D^k) (D1,D2,,Dk);
2) For datasets D i D^i Di,CC、PCC、CMI, as well as MIC It is used to directly calculate the correlation between genes , And get a list of four ranks ( G C C i , G P C C i , G C M I i , G M I C i ) (G_{CC}^i,G_{PCC}^i,G_{CMI}^i,G_{MIC}^i) (GCCi,GPCCi,GCMIi,GMICi), And integrate into G i G^i Gi;
3) Generate ( G 1 , G 2 , … , G k ) (G^1,G^2,\dots,G^k) (G1,G2,,Gk);
4) Integrated as G G G.


thank
Similar articles

2021-11-28

2021-12-17

2021-12-24

2022-05-22