Amino acid encoding schemes for machine learning methods

In this paper, we investigate the efficiency of a number of commonly used amino acid encodings by using artificial neural networks and substitution scoring matrices. An important step in many machine learning techniques applied in computational biology is encoding the symbolic data of protein sequences reasonably efficient in numeric vector representations. This encoding can be achieved by either considering the amino acid physicochemical properties or a generic numerical encoding. In order to be effective in the context of a machine learning system, an encoding must preserve information relative to the problem at hand, while diminishing superfluous data. To this end, it is important to measure how much an encoding scheme can conserve the underlying similarities and differences that exist among the amino acids. One way to evaluate the effectiveness of an amino acid encoding scheme is to compare it to the roles that amino acids are actually found to play in biological systems. A numerical representation of the similarities and differences between amino acids can be found in substitution matrices commonly used for sequence alignment, since these substitution matrices are based on measures of the interchangeability of amino acids in biological specimens. In this study, a new encoding scheme is also proposed based on the genetic codon coding occurs during protein synthesis. The experimental results indicate better performances compared to the other commonly used encodings.

[1]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[2]  Rosemarie Swanson,et al.  A vector representation for amino acid sequences , 1984 .

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[5]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[6]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[7]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[8]  Hae-Jin Hu,et al.  Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier , 2004, IEEE Transactions on NanoBioscience.

[9]  Cathy H. Wu,et al.  Neural networks and genome informatics , 2000 .

[10]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[11]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jean-François Gibrat,et al.  Amino acid "little Big Bang": Representing amino acid substitution matrices as dot products of Euclidian vectors , 2010, BMC Bioinformatics.

[13]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[14]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[15]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[16]  M. Sternberg,et al.  Prediction of protein secondary structure and active sites using the alignment of homologous sequences. , 1987, Journal of molecular biology.

[17]  Mikael Bodén,et al.  BLOMAP: An encoding of amino acids which improves signal peptide cleavage site prediction , 2005, APBC.