A Similar Fragments Merging Approach to Learn Automata on Proteins

We propose here to learn automata for the characterization of proteins families to overcome the limitations of the position-specific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning non-deterministic automata based on selection and ordering of significantly similar fragments to be merged and on physico-chemical properties identification. Quality of the characterization of the major intrinsic protein (MIP) family is assessed by leave-one-out cross-validation for a large range of models specificity.

[1]  Daniel Fredouille,et al.  What is the Search Space for the Inference of Non Deterministic, Unambiguous and Deterministic Automata ? , 2003 .

[2]  Barak A. Pearlmutter,et al.  Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm , 1998, ICGI.

[3]  Kevin J. Lang Random DFA's can be approximately learned from sparse uniform examples , 1992, COLT '92.

[4]  Sean R. Eddy,et al.  HMMER User's Guide - Biological sequence analysis using profile hidden Markov models , 1998 .

[5]  C. Patten,et al.  Finding Patterns in Biological Sequences , 2000 .

[6]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[8]  David Haussler,et al.  Recent Methods for RNA Modeling Using Stochastic Context-Free Grammars , 1994, CPM.

[9]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[10]  I. Rigoutsos,et al.  The emergence of pattern discovery techniques in computational biology. , 2000, Metabolic engineering.

[11]  Andrea Califano,et al.  SPLASH: structural pattern localization analysis by sequential histograms , 2000, Bioinform..

[12]  Takashi Yokomori,et al.  Learning non-deterministic finite automata from queries and counterexamples , 1994, Machine Intelligence 13.

[13]  Daniel Fredouille,et al.  Apprentissage d'automates par fusions de paires de fragments significativement similaires et premières expérimentations sur les protéines MIP , 2003 .

[14]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[15]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[16]  H. Gueuné,et al.  MIPDB: a relational database dedicated to MIP family proteins , 2005, Biology of the cell.

[17]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[18]  J. Oncina,et al.  INFERRING REGULAR LANGUAGES IN POLYNOMIAL UPDATED TIME , 1992 .

[19]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[20]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[21]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..