KIS: An automated attribute induction method for classification of DNA sequences

Abstract This paper presents an application of methods from the machine learning domain to solving the task of DNA sequence recognition. We present an algorithm that learns to recognize groups of DNA sequences sharing common features such as sequence functionality. We demonstrate application of the algorithm to find splice sites, i.e., to properly detect donor and acceptor sequences. We compare the results with those of reference methods that have been designed and tuned to detect splice sites. We also show how to use the algorithm to find a human readable model of the IRE (Iron-Responsive Element) and to find IRE sequences. The method, although universal, yields results which are of quality comparable to those obtained by reference methods. In contrast to reference methods, this approach uses models that operate on sequence patterns, which facilitates interpretation of the results by humans.

[1]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[4]  Zbigniew Michalewicz,et al.  Genetic algorithms + data structures = evolution programs (3rd ed.) , 1996 .

[5]  Jason Li,et al.  Splice site identification using probabilistic parameters and SVM classification , 2006, BMC Bioinformatics.

[6]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[7]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[8]  P. Sharp,et al.  Spliced segments at the 5′ terminus of adenovirus 2 late mRNA* , 1977, Proceedings of the National Academy of Sciences.

[9]  A Y Kashiwabara,et al.  Splice site prediction using stochastic regular grammars. , 2007, Genetics and molecular research : GMR.

[10]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[11]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[12]  B. Schölkopf,et al.  Accurate Splice Site Detection for Caenorhabditis elegans , 2004 .

[13]  J. Oncina,et al.  INFERRING REGULAR LANGUAGES IN POLYNOMIAL UPDATED TIME , 1992 .

[14]  Sören Sonnenburg,et al.  Machine learning for genomic sequence analysis , 2008 .

[15]  Joachim Diederich,et al.  Rule Extraction from Support Vector Machines , 2008, Studies in Computational Intelligence.

[16]  George Karypis,et al.  Evaluation of Techniques for Classifying Biological Sequences , 2002, PAKDD.

[17]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[18]  Joachim Diederich,et al.  The truth will come to light: directions and challenges in extracting the knowledge embedded within trained artificial neural networks , 1998, IEEE Trans. Neural Networks.

[19]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[20]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[21]  Graziano Pesole,et al.  The Untranslated Regions of Eukaryotic MRNAs: Structure, Function, Evolution and Bioinformatic Tools for Their Analysis , 2000, Briefings Bioinform..

[22]  B. TickleA.,et al.  The truth will come to light , 1998 .

[23]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[24]  Dana Ron,et al.  On the learnability and usage of acyclic probabilistic finite automata , 1995, COLT '95.

[25]  Andreu Català,et al.  Rule extraction from support vector machines , 2002, ESANN.

[26]  Christine G Elsik,et al.  Community annotation: procedures, protocols, and supporting tools. , 2006, Genome research.