RBNBC: Repeat Based Naive Bayes Classifier for Biological Sequences

In this paper, we present RBNBC, a repeat based Naive Bayes classifier of bio-sequences that uses maximal frequent subsequences as features. RBNBC's design is based on generic ideas that can apply to other domains where the data is organized as collections of sequences. Specifically, RBNBC uses a novel formulation of Naive Bayes that incorporates repeated occurrences of subsequences within each sequence. Our extensive experiments on two collections of protein families show that it performs as well as existing state-of-the-art probabilistic classifiers for bio-sequences. This is surprising as it is a pure data mining based generic classifier that does not require domain-specific background knowledge. We note that domain-specific ideas could further increase its performance.

[1]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[2]  Naoki Abe,et al.  On the computational complexity of approximating distributions by probabilistic automata , 1990, Machine Learning.

[3]  Paulo J. Azevedo,et al.  Query Driven Sequence Pattern Mining , 2006, SBBD.

[4]  Eleazar Eskin,et al.  Protein Family Classification Using Sparse Markov Transducers , 2000, J. Comput. Biol..

[5]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[6]  Golan Yona,et al.  Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[7]  Paulo J. Azevedo,et al.  Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers , 2005, EPIA.

[8]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[9]  Sotiris B. Kotsiantis,et al.  Increasing the Classification Accuracy of Simple Bayesian Classifier , 2004, AIMSA.

[10]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[11]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[12]  A. Krogh Hidden Markov Models in Computational Biology Applications to Protein Modeling UCSC CRL , 1993 .

[13]  Vasant Honavar,et al.  Multinomial Event Model Based Abstraction for Sequence and Text Classification , 2005, SARA.

[14]  Douglas L. Brutlag,et al.  Sequence Motifs: Highly Predictive Features of Protein Function , 2006, Feature Extraction.

[15]  Vasant Honavar,et al.  Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families , 2004 .

[16]  Mohammed J. Zaki,et al.  Mining features for sequence classification , 1999, KDD '99.

[17]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[18]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[19]  Mohammed J. Zaki,et al.  Scalable Feature Mining for Sequential Data , 2000, IEEE Intell. Syst..

[20]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[21]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[23]  Pourang Irani,et al.  2008 Eighth IEEE International Conference on Data Mining , 2008 .

[24]  Vasant Honavar,et al.  RNBL-MN: A Recursive Naive Bayes Learner for Sequence Classification , 2006, PAKDD.

[25]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[26]  Hannu Toivonen,et al.  Data Mining In Bioinformatics , 2005 .

[27]  Jason Weston,et al.  Multi-class Protein Classification Using Adaptive Codes , 2007, J. Mach. Learn. Res..

[28]  Srinivasan Parthasarathy,et al.  On the use of structure and sequence-based features for protein classification and retrieval , 2006, Sixth International Conference on Data Mining (ICDM'06).