### List of articles

# 1 introduce

## 1.1 subject

## 1.2 summary

Lysine 2- Hydroxyisobutyrylation is a new type of post-translational modification detected in proteomics . This modification research may contribute to the research and drug development of a variety of diseases . In this work , A new **2-hydr_Ensemble Residue identification algorithm **, This residue has sequence information at the protein level . This method is compared with typical classification models . Results show HeLa cells 、 Spirococcus 、 Rice seeds , And Saccharomyces cerevisiae AUC The values reach... Respectively 0.9197、0.8192、0.9307, as well as 0.8897. The statistical characteristics of Bayes with two profiles are further used , Find out the potential information from several eigenvectors .

## 1.3 Bib

```
@article{
Bao:2021:104351,
author = {
Wen Zheng Bao and Bin Yang and Bai Tong Chen},
title = {
2-hydr\_ensemble: Lysine 2-hydroxyisobutyrylation identification with ensemble method},
journal = {
Chemometrics and Intelligent Laboratory Systems},
volume = {
215},
pages = {
104351},
year = {
2021},
doi = {
10.1016/j.chemolab.2021.104351}
}
```

# 2 Method

## 2.1 Modified residues

### 2.1.1 The correlation coefficient (CC)

Pearson correlation coefficient is a linear correlation coefficient , It is generally used to correct the correlation between two variables of residues . For two gene sequences $X$ and $Y$,** Pearson correlation coefficient ** The calculation is as follows ：

$R_{X,Y}=∑(X−X)_{2}(Y−Y)_{2} ∑(X−X)(Y−Y) ,(1)$ among $X$ Express $X$ Average value .

### 2.1.2 Partial correlation coefficient (PCC)

Partial correlation coefficient is the correlation coefficient of two variables without the influence of other variables . Because the relationship between the two variables is very complex , May be affected by multiple variables , So the partial correlation coefficient is greater than CC Better choice .PCC According to its corresponding CC To define . Make $R$ Express CC matrix , Its inverse matrix is $R_{−1}$, be PCC The calculation for the ：

$R_{X,Y}=R_{X,X}R_{Y,Y} R_{X,Y} (2)$

### 2.1.3 Conditional mutual information (CMI)

Mutual information (MI) It can measure the non-linear correlation between modified residues and unmodified residues ：

$I(X,Y)=x∈X∑ y∈Y∑ p(x,y)gp(x)p(y)p(x,y) ,(3)$ among $p(x)$ yes $x$ Probability 、$p(x,y)$ It's the joint probability , They can be obtained by Gaussian kernel probability density estimation ：

$p(x_{i})=N1 j=1∑N (2π)_{n/2}σ_{x}1 exp(−21 (X_{j}−X_{i})_{T}C_{−1}(X_{j}−X_{i})),(4)$ among $C$ Express $X$ The covariance matrix of 、$σ_{x}$ Express $C$ Standard deviation , as well as $n$ and $N$ It represents the number of genes and the number of gene expression points respectively . therefore ,MI It can be calculated as ：

$I(X,Y)=21 g(∣C(X,Y)∣σ_{X}σ_{Y} ),(5)$ among $∣C(X,Y)∣$ It's a determinant .

However ,MI There are high estimate problems . therefore ,** Conditional mutual information ** (CMI) Proposed ：

$CMI(X,Y∣Z)=x∈X,y∈Y,z∈Z∑ p(x,y,z)gp(x∣z)p(y∣z)p(x,y∣z) .(6)$ If two genes $X$$Y$ It's irrelevant , $CMI(X,Y∣Z)=0$.

### 2.1.4 Maximum information coefficient (MIC)

Maximum information coefficient (MIC) It is used to measure the linear or nonlinear relationship between two variables , It does not need to make assumptions about the distribution of data . Given a binary set , Where the data elements are ordered tuples $(a,b)$.$G$ It's a grid .$a$ and $b$ The maximum information gain of all meshes of size is calculated as ：

$I_{∗}(D,a,b)=maxI(D∣_{G}),(7)$ among $I(D∣_{G})$ Express $D∣_{G}$ Mutual information of ,$M(D)$ yes $D$ Characteristic matrix of , It is calculated as ：

$M(D)_{a,b}=g(min(a,b))I_{∗}(D,a,b) .(8)$$max(M(D))$ It's genes $a$ and $b$ Between MIC, If the two genes are not related , Their MIC Will be equal to the 0.

## 2.2 Integration method

In order to improve the accuracy of detecting directly modified residues , A new dual integration method is proposed ：

1） Given contains $m$ Genes and $n$ Gene data set of sample points $D$, Generate $K$ Data sets $(D_{1},D_{2},…,D_{k})$;

2） For datasets $D_{i}$,CC、PCC、CMI, as well as MIC It is used to directly calculate the correlation between genes , And get a list of four ranks $(G_{CC},G_{PCC},G_{CMI},G_{MIC})$, And integrate into $G_{i}$;

3） Generate $(G_{1},G_{2},…,G_{k})$;

4） Integrated as $G$.

thank