Learning Automata on Protein Sequences

Pattern discovery is limited to position-specific characterizations like Prosite's patterns or profile-HMMs which are unable to handle, for instance, dependencies between amino acids distant in the sequence of a protein, but close in its three-dimensional structure. To overcome these limitations, we propose to learn automata on proteins. Inspired by grammatical inference and multiple alignment techniques, we introduce a sequence-driven approach based on the idea of merging ordered partial local multiple alignments (PLMA) under preservation or consistency constraints and on an identification of informative positions with respect to physico-chemical properties . The quality of the characterization is asserted experimentally on two difficult sets of proteins by a comparison with (semi)-manually designed patterns of Prosite and with state-of-the-art pattern discovery algorithms. Further leave-one-out experimentations show that learning more precise automata allows to gain in accuracy by increasing the classification margins.

[1]  Ron D. Appel,et al.  MoDEL: an efficient strategy for ungapped local multiple alignment , 2004, Comput. Biol. Chem..

[2]  Takashi Yokomori,et al.  Learning non-deterministic finite automata from queries and counterexamples , 1994, Machine Intelligence 13.

[3]  Burkhard Morgenstern,et al.  Speeding Up the DIALIGN Multiple Alignment Program by Using the 'Greedy Alignment of BIOlogical Sequences LIBrary' (GABIOS-LIB) , 2000, JOBIM.

[4]  J. Oncina,et al.  INFERRING REGULAR LANGUAGES IN POLYNOMIAL UPDATED TIME , 1992 .

[5]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[6]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[8]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[9]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[10]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[11]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[12]  Daniel Fredouille,et al.  What is the Search Space for the Inference of Non Deterministic, Unambiguous and Deterministic Automata ? , 2003 .

[13]  Daniel Fredouille,et al.  Apprentissage d'automates par fusions de paires de fragments significativement similaires et premières expérimentations sur les protéines MIP , 2003 .

[14]  Andrea Califano,et al.  SPLASH: structural pattern localization analysis by sequential histograms , 2000, Bioinform..

[15]  Barak A. Pearlmutter,et al.  Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm , 1998, ICGI.

[16]  A. Ashkenazi,et al.  Targeting death and decoy receptors of the tumour-necrosis factor superfamily , 2002, Nature Reviews Cancer.

[17]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[18]  Mark P. Styczynski,et al.  A generic motif discovery algorithm for sequential data. , 2006, Bioinformatics.

[19]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[20]  I. Rigoutsos,et al.  The emergence of pattern discovery techniques in computational biology. , 2000, Metabolic engineering.

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[23]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[24]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[25]  François Coste,et al.  A Similar Fragments Merging Approach to Learn Automata on Proteins , 2005, ECML.

[26]  Sean R. Eddy,et al.  HMMER User's Guide - Biological sequence analysis using profile hidden Markov models , 1998 .

[27]  H. Gueuné,et al.  MIPDB: a relational database dedicated to MIP family proteins , 2005, Biology of the cell.

[28]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[29]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[30]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[31]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[32]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[33]  Yasubumi Sakakibara,et al.  Grammatical inference in bioinformatics , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[35]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[36]  Eytan Ruppin,et al.  Unsupervised learning of natural languages , 2006 .