Integrating AI with sequence analysis

This chapter will discuss one example of how AI techniques are being integrated with, and extending, existing molecular biology sequence analysis methods. AI ideas of complex representations, pattern recognition, search, and machine learning have been applied to the task of inferring and recognizing structural patterns associated with molecular function. We wish to construct such patterns, and to recognize them in unknown molecules, based on information inferred solely from protein primary (amino acid) sequences. Besides its intrinsic interest as a difficult machine learning task of induction from complex and noisy data, this is of interest in the empirical domain for:

[1]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[2]  W R Taylor,et al.  Pattern matching methods in protein sequence comparison and structure prediction. , 1988, Protein engineering.

[3]  Christopher J. Rawlings,et al.  Reasoning about protein topology using the logic programming language PROLOG , 1985 .

[4]  W. A. Scott,et al.  Introduction to Psychological Research , 1962 .

[5]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[6]  J. Ponder,et al.  Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. , 1987, Journal of molecular biology.

[7]  S. Friend,et al.  Large T antigens of many polyomaviruses are able to form complexes with the retinoblastoma protein , 1990, Journal of virology.

[8]  R. M. Abarbanel,et al.  Turn prediction in proteins using a pattern-matching approach. , 1986, Biochemistry.

[9]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[10]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[11]  D. Eisenberg,et al.  The hydrophobic moment detects periodicity in protein hydrophobicity. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Russ B. Altman,et al.  PROTEAN: Deriving Protein Structure from Constraints , 1986, AAAI.

[13]  E. Moran,et al.  A region of SV40 large T antigen can substitute for a transforming domain of the adenovirus E1A products , 1988, Nature.

[14]  Douglas L. Brutlag,et al.  Improved sensitivity of biological sequence database searches , 1990, Comput. Appl. Biosci..

[15]  Malcolm J. McGregor,et al.  Prediction of ?-turns in proteins using neural network , 1989 .

[16]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[17]  C. M. Henneke,et al.  A multiple sequence alignment algorithm for homologous proteins using secondary structure information and optionally keying alignments to functionally important sites , 1989, Comput. Appl. Biosci..

[18]  R H Lathrop,et al.  Consensus topography in the ATP binding site of the simian virus 40 and polyomavirus large tumor antigens. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[19]  D. R. Boswell A program for template matching of protein sequences , 1988, Comput. Appl. Biosci..

[20]  G J Barton,et al.  A knowledge-based architecture for protein sequence analysis and structure prediction. , 1990, Journal of molecular graphics.

[21]  Martin Vingron,et al.  A fast and sensitive multiple sequence alignment algorithm , 1989, Comput. Appl. Biosci..

[22]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[23]  David Eisenberg,et al.  The helical hydrophobic moment: a measure of the amphiphilicity of a helix , 1982, Nature.

[24]  J. M. Thornton,et al.  Prediction of super-secondary structure in proteins , 1983, Nature.

[25]  T. Smith,et al.  Prediction of similar transforming regions in simian virus 40 large T, adenovirus E1A, and myc oncoproteins , 1988, Journal of virology.

[26]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[27]  Harold Lathrop Richard Efficient methods for massively parallel symbolic induction : algorithms and implementation , 1990 .

[28]  P. Y. Chou,et al.  Empirical predictions of protein conformation. , 1978, Annual review of biochemistry.

[29]  L. Patthy,et al.  Detecting homology of distantly related proteins with consensus sequences. , 1987, Journal of molecular biology.

[30]  F. Cohen,et al.  Pattern-based approaches to protein structure prediction. , 1991, Methods in enzymology.

[31]  A. Johansson,et al.  Automatic evaluation of protein sequence functional patterns , 1991, Comput. Appl. Biosci..

[32]  Q L Zhu,et al.  Acid helix‐turn activator motif , 1990, Proteins.

[33]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[34]  J. Mesirov,et al.  Hybrid system for protein secondary structure prediction. , 1992, Journal of molecular biology.

[35]  T. P. Flores,et al.  Prediction of beta-turns in proteins using neural networks. , 1989, Protein engineering.

[36]  G Kolata Trying to crack the second half of the genetic code. , 1986, Science.

[37]  Charlie Hodgman,et al.  The elucidation of protein function from its amino acid sequence , 1986, Comput. Appl. Biosci..

[38]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[39]  A. Lesk,et al.  Determinants of a protein fold. Unique features of the globin amino acid sequences. , 1987, Journal of molecular biology.

[40]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[41]  Jude Shavlik,et al.  Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[42]  T. Smith,et al.  Alignment of protein sequences using secondary structure: a modified dynamic programming method. , 1990, Protein engineering.

[43]  K. Y. Cockwell,et al.  Software tools for motif and pattern scanning: program descriptions including a universal sequence reading algorithm , 1989, Comput. Appl. Biosci..

[44]  R H Lathrop,et al.  Potential structural motifs for reverse transcriptases. , 1989, Molecular biology and evolution.

[45]  G D Schuler,et al.  A workbench for multiple alignment construction and analysis , 1991, Proteins.

[46]  M S Waterman,et al.  Consensus methods for DNA and protein sequence alignment. , 1990, Methods in enzymology.

[47]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[48]  Michael Gribskov,et al.  Profile scanning for three-dimensional structural patterns in protein sequences , 1988, Comput. Appl. Biosci..

[49]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[50]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[51]  Richard G. Lathrop,et al.  Introduction to psychological research : logic, design, analysis , 1969 .

[52]  T. L. Blundell,et al.  Knowledge-based prediction of protein structures and the design of novel molecules , 1987, Nature.

[53]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[54]  K. Münger,et al.  The human papilloma virus-16 E7 oncoprotein is able to bind to the retinoblastoma gene product. , 1989, Science.

[55]  Richard H. Lathrop,et al.  ARIEL: a massively parallel symbolic learning assistant for protein structure and function , 1991 .

[56]  M J Sternberg,et al.  Machine learning approach for the prediction of protein secondary structure. , 1990, Journal of molecular biology.

[57]  R H Lathrop,et al.  Prediction of a common structural domain in aminoacyl-tRNA synthetases through use of a new pattern-directed inference system. , 1987, Biochemistry.

[58]  J. Thornton,et al.  Protein motifs and data-base searching. , 1989, Trends in biochemical sciences.

[59]  Jun Ma,et al.  Deletion analysis of GAL4 defines two transcriptional activating segments , 1987, Cell.

[60]  Richard H. Lathrop,et al.  ARIADNE: pattern-directed inference and hierarchical abstraction in protein structure recognition , 1987, CACM.

[61]  Richard Maclin,et al.  Refining algorithms with knowledge-based neural networks: improving the Chou-Fasman algorithm for protein folding , 1994, COLT 1994.

[62]  Jill P. Mesirov,et al.  Study of protein sequence comparison metrics on the connection machine CM-2 , 1989, Proceedings Supercomputing Vol.II: Science and Applications.

[63]  H. Ruley,et al.  Two regions of the adenovirus early region 1A proteins are required for transformation , 1988, Journal of virology.

[64]  Temple F. Smith,et al.  Cell-division sequence motif , 1988, Nature.

[65]  Kevin Struhl,et al.  Structural and functional characterization of the short acidic transcriptional activation region of yeast GCN4 protein , 1988, Nature.

[66]  M. Karplus,et al.  Protein secondary structure prediction with a neural network. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[67]  R F Doolittle,et al.  Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. , 1983, Science.

[68]  Catherine Macken,et al.  Some statistical problems in the assessment of inhomogeneities of DNA sequence data , 1991 .

[69]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[70]  Michael R. Green,et al.  Transcription activation by the adenovirus E1a protein , 1989, Nature.

[71]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[72]  M S Waterman,et al.  Multiple sequence alignment by consensus. , 1986, Nucleic acids research.

[73]  Patrick Henry Winston,et al.  Artificial intelligence (2nd ed.) , 1984 .

[74]  S Karlin,et al.  A method to identify distinctive charge configurations in protein sequences, with application to human herpesvirus polypeptides. , 1989, Journal of molecular biology.

[75]  L. Patthy,et al.  Detecting distant homologies of mosaic proteins. Analysis of the sequences of thrombomodulin, thrombospondin complement components C9, C8 alpha and C8 beta, vitronectin and plasma cell membrane glycoprotein PC-1. , 1988, Journal of molecular biology.

[76]  Richard H. Lathrop,et al.  Massively Parallel Symbolic Induction of Protein Structure/Function Relationships , 1993, Machine Learning: From Theory to Applications.

[77]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[78]  N. Itoh,et al.  Molecular cloning and sequence analysis of cDNA for batroxobin, a thrombin-like snake venom enzyme. , 1987, The Journal of biological chemistry.

[79]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[80]  Benny Lautrup,et al.  A novel approach to prediction of the 3‐dimensional structures of protein backbones by neural networks , 1990, NIPS.

[81]  K Matsubara,et al.  Cloning, characterization and nucleotide sequences of two cDNAs encoding human pancreatic trypsinogens. , 1986, Gene.

[82]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[83]  Michael G. Rossmann,et al.  Chemical and biological evolution of a nucleotide-binding protein , 1974, Nature.

[84]  J. Hein Unified approach to alignment and phylogenies. , 1990, Methods in enzymology.

[85]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[86]  P Bork,et al.  Recognition of different nucleotide-binding sites in primary structures using a property-pattern approach. , 1990, European journal of biochemistry.

[87]  F. Šorm,et al.  Covalent structure of bovine trypsinogen. The position of the remaining amides. , 1966, Biochemical and biophysical research communications.

[88]  Steven M. Muskal,et al.  Prediction of the disulfide-bonding state of cysteine in proteins. , 1990, Protein engineering.

[89]  G. Temple,et al.  Nucleotide sequence of human papillomavirus type 31: a cervical neoplasia-associated virus. , 1989, Virology.

[90]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[91]  G J Barton,et al.  Evaluation and improvements in the automatic alignment of protein sequences. , 1987, Protein engineering.

[92]  R J Fletterick,et al.  Secondary structure assignment for alpha/beta proteins by a combinatorial approach. , 1983, Biochemistry.

[93]  S H Kim,et al.  Predicting surface exposure of amino acids from protein sequence. , 1990, Protein engineering.

[94]  T F Smith,et al.  Structural characterization of a 14-residue peptide ligand of the retinoblastoma protein: comparison with a nonbinding analog. , 1991, Peptide research.

[95]  G. Barton Protein multiple sequence alignment and flexible pattern matching. , 1990, Methods in enzymology.

[96]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[97]  Ruth Nussinov,et al.  A fixed-point alignment technique for detection of recurrent and common sequence motifs associated with biological features , 1988, Comput. Appl. Biosci..

[98]  R. Evans,et al.  Multiple and cooperative trans-activation domains of the human glucocorticoid receptor , 1988, Cell.

[99]  J. Felsenstein Phylogenies and the Comparative Method , 1985, The American Naturalist.

[100]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[101]  Wen-Hwa Lee,et al.  SV40 large tumor antigen forms a specific complex with the product of the retinoblastoma susceptibility gene , 1988, Cell.

[102]  G Lapalme,et al.  The combination of symbolic and numerical computation for three-dimensional modeling of RNA. , 1991, Science.

[103]  T. Creighton Proteins: Structures and Molecular Properties , 1986 .

[104]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[105]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[106]  S F Altschul,et al.  Statistical methods and insights for protein and DNA sequences. , 1991, Annual review of biophysics and biophysical chemistry.

[107]  K. Scheidtmann,et al.  In vitro phosphorylation of SV40 large T antigen. , 1988, Virology.

[108]  Douglas L. Brutlag,et al.  Rapid searches for complex patterns in biological molecules , 1984, Nucleic Acids Res..

[109]  R. F. Smith,et al.  Identification of new protein kinase-related genes in three herpesviruses, herpes simplex virus, varicella-zoster virus, and Epstein-Barr virus , 1989, Journal of virology.

[110]  T. Hunter,et al.  The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. , 1988, Science.

[111]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[112]  Janet M. Thornton,et al.  Prediction of progress at last , 1991, Nature.

[113]  Teresa A. Webster,et al.  A modified Chou and Fasman protein structure algorithm , 1987, Comput. Appl. Biosci..

[114]  R H Lathrop,et al.  Pattern descriptors and the unidentified reading frame 6 human mtDNA dinucleotide‐binding site , 1988, Proteins.

[115]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[116]  Victor R. Lesser,et al.  A Multi-Level Organization For Problem Solving Using Many, Diverse, Cooperating Sources Of Knowledge , 1975, IJCAI.

[117]  Rodger Staden,et al.  Methods to define and locate patterns of motifs in sequences , 1988, Comput. Appl. Biosci..

[118]  Joshua Lederberg,et al.  Applications of Artificial Intelligence for Chemical Inference: The Dendral Project , 1980 .

[119]  G. Fasman Prediction of Protein Structure and the Principles of Protein Conformation , 2012, Springer US.

[120]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[121]  Scott R. Presnell,et al.  Protein structure predictions: new theoretical approaches. , 1989, Progress in clinical and biological research.

[122]  K. Münger,et al.  Complex formation of human papillomavirus E7 proteins with the retinoblastoma tumor suppressor gene product. , 1989, The EMBO journal.

[123]  D. Pim,et al.  Comparison of the in vitro transforming activities of human papillomavirus types. , 1988, The EMBO journal.

[124]  David M. Livingston,et al.  The product of the retinoblastoma susceptibility gene has properties of a cell cycle regulatory element , 1989, Cell.

[125]  Lawrence Hunter,et al.  Efficient Classification of Massive, Unsegmented Datastreams , 1992, ML.

[126]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[127]  P. Argos,et al.  Scrutineer: a computer program that flexibly seeks and describes motifs and profiles in protein sequence databases [published erratum appears in Comput Appl Biosci 1990 Oct;6(4): 431] , 1990, Comput. Appl. Biosci..

[128]  R. F. Smith,et al.  Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. , 1992, Protein engineering.

[129]  Protein structure. Prediction of progress at last. , 1991, Nature.