Paper reading (56):muti features predction of protein translational modification sites (task)

Inge2022-06-23 18:04:36

1 introduce

1.1 subject

2017: Multi feature prediction of protein translation modification sites (Mutli-features predction of protein translational modification sites)

1.2 summary

Post translational modification (Post translational modification, PTM) It plays an important role in biological processing . Potential post-translational modifications consist of central sites and adjacent amino acid residues , They are basic protein sequence residues , It helps to exert their biological functions , It is also helpful to understand the molecular mechanism that is the basis of protein design and drug design . The existing modification site prediction algorithms often have low stability and accuracy And so on .
This paper combines the physics of protein 、 chemical 、 Statistical and biological characteristics , A new framework is proposed to predict the post-translational modification sites of proteins . call Multilayer neural network and support vector machine To predict potential modification sites with selected characteristics , These characteristics include the composition of amino acid residues 、 Of protein fragments E-H Description and AAIndex Several properties in the database . Consider possible redundant information , In the processing step, the feature selection . Experimental results show that , The proposed method can improve the accuracy of the classification problem .

2 Method

2.1 Data sets

The function of a protein depends on its spatial conformation . therefore , The spatial structure of protein fragments may be helpful to analyze and identify the characteristics of potential modification sites .
Experimental data sets yes PTM Benchmark data set for the prediction field :
1) A well-known database in the field of protein post-translational modification CPLM. The database contains 2500 Multiple lysine succinylation sites and as positive samples 24000 Non succinylation sites as negative samples , From 896 Protein sequences . All the above protein fragments and polypeptide sequences are from UniProt, This is a famous protein database in the field of bioinformatics . It has been used for enzyme specificity (ES) And protein - Protein binding sites (PPB) The study of .
2) be used for Predict a variety of protein sequences K-PTM Type of modification site Framework , It contains 6394 Potential modification sites , These loci are considered to come from 27 Tuple peptide like . Yes 1750 Samples do not belong to the four K-PTM Any one of the types ,3895 Samples belong to a kind of K-PTM,740 The samples belong to two kinds PTM type ,9 The samples belong to three kinds PTM type , All four types do not .
3) Post - translational modification of fragment data sets . Lysine acetylation site datasets for three species , Including Homo sapiens 、 House mouse and Saccharomyces cerevisiae , From multiple sources , Include PhosphoSite、UniProtKB/Swiss-Prot、UbiProt and SCUD, These are well-known databases in the field of proteomics . Because ubiquitin seems to be attached to lysine residues of proteins to some extent . therefore , In our work, we only considered lysine ubiquitination in the above three species . The original data set includes 11547 Protein sequences covering different species ; In these sequences , exceed 8000 One from H.sapiens, about 3300 One from M.musculus, exceed 4500 One from S.cerevisiae. Remove 3 After the redundant protein fragments of the samples , Extract to 3 Multiple samples of samples , Among them are 6323 Share H.sapiens sample 、2342 Share M.musculus Samples and 7863 Share S.cerevisiaes sample . after , Randomly selected from each data set of three species 20 Three proteins form a separate test set , The rest 6303、2322 and 7843 Three proteins were used to construct the training set .

2.2 feature selection

Generally speaking , The types of protein characteristics can reach 4 More than ten thousand . These various types of features , Including amino acid composition model (AAC) Pseudo amino acid composition model (PseAAC) And other relevant information about protein characteristics [26]. However , These characteristics are difficult to effectively and accurately describe the interaction between predicted modification sites and adjacent amino acid residues . therefore , This paper introduces a typical 、 Special features , It has the ability to describe protein peptides .
First , When it comes to the composition of amino acid residues , Many researchers in bioinformatics and computational biology usually use the statistical information of protein sequences . These characteristics only describe the potential modification of the statistical aspects . Of course , In such feature sets , The selection of key features can be seen as a difficult task .
Found to have 20 Amino acid residues in 3 Class special structural elements : screw 、 There is a tendency to be swallowed up in chains and spirals . These functions are selected from PSIPRED. PSIPRED The developers of e.g .
Consider effectively α \alpha α Helix and β \beta β Chain distribution , We use it E-H Sequence description Represents the predicted protein fragment . The following table contains E-H Several features described . From the above characteristics , Both basic features and new features can describe the E and H Type of statistics . Because all the above features contain some redundant information and noise . therefore , The selected features are shown in the following table .

The most popular and well-known amino acid signature index is AAindex, It is a digitally indexed website database , Various biology including amino acid residues 、 Physical and chemical properties and characteristics of other forms of protein sequences . meanwhile ,AAindex Contains information on three protein properties :AAindex1、AAindex2 and AAindex3 [27-29]. therefore , The characteristics of several amino acids were used in this study .

