Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment

As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.

[1]  Lukasz A. Kurgan,et al.  Review and comparative assessment of sequence‐based predictors of protein‐binding residues , 2018, Briefings Bioinform..

[2]  Alice C McHardy,et al.  Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) , 2018, Scientific Reports.

[3]  Ersin Emre Oren,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm436 Sequence analysis , 2022 .

[4]  Albert Y. Zomaya,et al.  Machine Learning Techniques for Protein Secondary Structure Prediction:An Overview and Evaluation , 2008 .

[5]  Hae-Jin Hu,et al.  Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier , 2004, IEEE Transactions on NanoBioscience.

[6]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[7]  S. K. Riis,et al.  Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. , 1996, Journal of computational biology : a journal of computational molecular cell biology.

[8]  Jaime G. Carbonell,et al.  Comparative n-gram analysis of whole-genome protein sequences , 2002 .

[9]  Robert David,et al.  Applications of nonlinear system identification to protein structural prediction , 2000 .

[10]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[11]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[12]  R. Jernigan,et al.  Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation , 1985 .

[13]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[14]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XII , 2018, Proteins.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[17]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[18]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[19]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[20]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[21]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[22]  R. Russell,et al.  Amino‐Acid Properties and Consequences of Substitutions , 2003 .

[23]  Kuldip K. Paliwal,et al.  Sixty-five years of the long march in protein secondary structure prediction: the final stretch? , 2016, Briefings Bioinform..

[24]  Hongbo Mu,et al.  An ensemble approach to protein fold classification by integration of template‐based assignment and support vector machine classifier , 2016, Bioinform..

[25]  Johannes Schuchhardt,et al.  Adaptive encoding neural networks for the recognition of human signal peptide cleavage sites , 2000, Bioinform..

[26]  Qin Lu,et al.  CNNH_PSS: protein 8-class secondary structure prediction by convolutional neural network with highway , 2018, BMC Bioinformatics.

[27]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[28]  Jiangning Song,et al.  PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction , 2018, Scientific Reports.

[29]  J R Banavar,et al.  Learning effective amino acid interactions through iterative stochastic techniques , 2000, Proteins.

[30]  L. Kier,et al.  Amino acid side chain parameters for correlation studies in biology and pharmacology. , 2009, International journal of peptide and protein research.

[31]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[32]  Wei Zheng,et al.  A large-scale comparative assessment of methods for residue–residue contact prediction , 2016, Briefings Bioinform..

[33]  H. Scheraga,et al.  Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. , 1976, Macromolecules.

[34]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35]  Junjie Chen,et al.  A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[36]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Stefan C. Kremer,et al.  Amino acid encoding schemes for machine learning methods , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[38]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[39]  Zhen Li,et al.  Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks , 2016, IJCAI.

[40]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[41]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[42]  W. Atchley,et al.  Solving the protein sequence metric problem. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Siby Abraham,et al.  Reaching Optimized Parameter Set, Protein Secondary Structure Prediction Using Neural Network , 2018 .

[44]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[45]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[46]  Richard Wolfenden,et al.  Comparing the polarities of the amino acids: side-chain distribution coefficients between the vapor phase, cyclohexane, 1-octanol, and neutral aqueous solution , 1988 .

[47]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[48]  A. Godzik,et al.  Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? , 1997, Protein science : a publication of the Protein Society.

[49]  Anders Krogh,et al.  Improving Predicition of Protein Secondary Structure Using Structured Neural Networks and Multiple Sequence Alignments , 1996, J. Comput. Biol..

[50]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[51]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[52]  G. Rose,et al.  Hydrophobicity of amino acid residues in globular proteins. , 1985, Science.

[53]  R. Jernigan,et al.  Self‐consistent estimation of inter‐residue protein contact energies based on an equilibrium mixture approximation of residues , 1999, Proteins.

[54]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[55]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP) — round x , 2014, Proteins.

[56]  Kuldip K. Paliwal,et al.  Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility , 2017, Bioinform..

[57]  Arne Elofsson,et al.  A study on protein sequence alignment quality , 2002, Proteins.

[58]  Ole Winther,et al.  Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction , 2017, BCB.

[59]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[60]  S H Kim,et al.  Environment-dependent residue contact energies for proteins. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[61]  S. W. Atlanta Using a neural network to backtranslate amino acid sequences , .

[62]  Jie Hou,et al.  DeepSF: deep convolutional neural network for mapping protein sequences to folds , 2017, Bioinform..

[63]  Brian R. King,et al.  Mining for class-specific motifs in protein sequence classification , 2012, BMC Bioinformatics.

[64]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[65]  R Dustin Schaeffer,et al.  CASP 11 target classification , 2016, Proteins.

[66]  R. Jernigan,et al.  Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. , 1996, Journal of molecular biology.

[67]  Jens Meiler,et al.  Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks , 2001 .

[68]  William R Taylor,et al.  Amino acid encoding schemes from protein structure alignments: multi-dimensional vectors to describe residue types. , 2002, Journal of theoretical biology.

[69]  P.C. Tai,et al.  Parallel protein secondary structure prediction based on neural networks , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[70]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..