Local Substitutability for Sequence Generalization

Genomic banks are fed continuously by large sets of DNA or RNA sequences coming from high throughput machines. Protein annotation is a task of first importance with respect to these banks. It consists of retrieving the genes that code for proteins within the sequences and then predict the function of these new proteins in the cell by comparison with known families. Many methods have been designed to characterize protein families and discover new members, mainly based on subsets of regular expressions or simple Hidden Markov Models. We are interested in more expressive models that are able to capture the longrange characteristic interactions occurring in the spatial structure of the analyzed protein family. Starting from the work of Clark and Eyraud (2007) and Yoshinaka (2008) on inference of substitutable and k, l-substitutable languages respectively, we introduce new classes of substitutable languages using local rather than global substitutability, a reasonable assumption with respect to protein structures to enhance inductive leaps performed by least generalized generalization approaches. The concepts are illustrated on a first experiment using a real proteic sequence set.

[1]  Damián López,et al.  Protein Motif Prediction by Grammatical Inference , 2006, ICGI.

[2]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[3]  Alexander Clark,et al.  Polynomial Identification in the Limit of Substitutable Context-free Languages , 2005 .

[4]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[5]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[6]  Enrique Vidal,et al.  Learning Locally Testable Languages in the Strict Sense , 1990, ALT.

[7]  Satoshi Kobayashi,et al.  Learning local languages and its application to protein /spl alpha/-chain identification , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[8]  Ryo Yoshinaka,et al.  Identification in the Limit of k, l-Substitutable Context-Free Languages , 2008, ICGI.

[9]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[10]  Enrique Vidal,et al.  Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Jean-Christophe Nebel,et al.  A stochastic context free grammar based framework for analysis of protein sequences , 2009, BMC Bioinformatics.

[12]  Goulven Kerbellec,et al.  Apprentissage d'automates modélisant des familles de séquences protéiques. (Learning automata modelling families of protein sequences) , 2008 .

[13]  Michael Y. Galperin,et al.  From complete genome sequence to 'complete' understanding? , 2010, Trends in biotechnology.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Francisco Casacuberta,et al.  Local Languages, the Succesor Method, and a Step Towards a General Methodology for the Inference of Regular Grammars , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[17]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[18]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[19]  François Coste,et al.  A Similar Fragments Merging Approach to Learn Automata on Proteins , 2005, ECML.

[20]  Pascal Caron Families of locally testable languages , 2000, Theor. Comput. Sci..

[21]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[22]  R. Chiodini,et al.  The impact of next-generation sequencing on genomics. , 2011, Journal of genetics and genomics = Yi chuan xue bao.

[23]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[24]  Franco M. Luque,et al.  PAC-Learning Unambiguous k, l-NTS <= Languages , 2010, ICGI.

[25]  Michael Elhadad Book Review: Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper , 2010, CL.