Representation of Protein-Sequence Information by Amino Acid Subalphabets

Within computational biology, algorithms are constructed with the aim of extracting knowledge from biological data, in particular, data generated by the large genome projects, where gene and protein sequences are produced in high volume. In this article, we explore new ways of representing protein-sequence information, using machine learning strategies, where the primary goal is the discovery of novel powerful representations for use in AI techniques. In the case of proteins and the 20 different amino acids they typically contain, it is also a secondary goal to discover how the current selection of amino acids--which now are common in proteins--might have emerged from simpler selections, or alphabets, in use earlier during the evolution of living organisms.

[1]  Jun Wang,et al.  A computational approach to simplifying the protein folding alphabet , 1999, Nature Structural Biology.

[2]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[3]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[4]  Neal S. Holter,et al.  Amino acid classes and the protein folding problem , 2000, cond-mat/0010244.

[5]  P Argos,et al.  The role of side-chain hydrogen bonds in the formation and stabilization of secondary structure in soluble proteins. , 1994, Journal of molecular biology.

[6]  R. Jernigan,et al.  Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. , 1996, Journal of molecular biology.

[7]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[8]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[9]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[10]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[11]  S. Brunak,et al.  Protein secondary structure and homology by neural networks The α‐helices in rhodopsin , 1988 .

[12]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[13]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[14]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[15]  E J Milner-White,et al.  A recurring two-hydrogen-bond motif incorporating a serine or threonine residue is found both at alpha-helical N termini and in other situations. , 1999, Journal of molecular biology.

[16]  M. Lings,et al.  Articles , 1967, Soil Science Society of America Journal.

[17]  S Brunak,et al.  Protein secondary structure and homology by neural networks. The alpha-helices in rhodopsin. , 1988, FEBS letters.

[18]  B. Rost,et al.  Redefining the goals of protein secondary structure prediction. , 1994, Journal of molecular biology.

[19]  H. Chan Folding alphabets , 1999, Nature Structural Biology.

[20]  R. Klevit,et al.  Increased helix and protein stability through the introduction of a new tertiary hydrogen bond. , 1999, Journal of molecular biology.

[21]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[22]  M. A. Wouters,et al.  An analysis of side chain interactions and pair correlations within antiparallel β‐sheets: The differences between backbone hydrogen‐bonded and non‐hydrogen‐bonded residue pairs , 1995, Proteins.

[23]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[24]  Peter G. Korning,et al.  Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. , 1996, Nucleic acids research.

[25]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[26]  D. Baker,et al.  Functional rapidly folding proteins from simplified amino acid sequences , 1997, Nature Structural Biology.

[27]  Journal of Molecular Biology , 1959, Nature.