Associative Naïve Bayes classifier: Automated linking of gene ontology to medline documents

We demonstrate a text-mining method, called associative Naive Bayes (ANB) classifier, for automated linking of MEDLINE documents to gene ontology (GO). The approach of this paper is a nontrivial extension of document classification methodology from a fixed set of classes C={c"1,c"2,...,c"n} to a knowledge hierarchy like GO. Due to the complexity of GO, we use a knowledge representation structure. With that structure, we develop the text mining classifier, called ANB classifier, which automatically links Medline documents to GO. To check the performance, we compare our datasets under several well-known classifiers: NB classifier, large Bayes classifier, support vector machine and ANB classifier. Our results, described in the following, indicate its practical usefulness.

[1]  William H. Majoros,et al.  Genomics and natural language processing , 2002, Nature Reviews Genetics.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[4]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[5]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[6]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[7]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[8]  Philip M. Lewis,et al.  Approximating Probability Distributions to Reduce Storage Requirements , 1959, Information and Control.

[9]  Geoffrey I. Webb Naïve Bayes , 2020, Encyclopedia of Machine Learning.

[10]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[11]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[12]  John G. Cleary,et al.  Automatically linking MEDLINE abstracts to the Gene Ontology , 2003 .

[13]  Teuvo Kohonen,et al.  Self-Organization of Very Large Document Collections: State of the Art , 1998 .

[14]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[15]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[16]  A. McCray,et al.  The Lexical Properties of the Gene Ontology ( GO ) , 2002 .

[17]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[18]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[19]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Olivier Bodenreider,et al.  The lexical properties of the gene ontology , 2002, AMIA.

[22]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[23]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[24]  Hongjun Lu,et al.  Scalable association-based text classification , 2000, CIKM '00.

[25]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[26]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[27]  Thomas R. Gruber,et al.  A Translation Approach to Portable Ontologies , 1993 .

[28]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[29]  Dimitris Meretakis,et al.  Extending naïve Bayes classifiers using long itemsets , 1999, KDD '99.

[30]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[31]  Steffen Schulze-Kremer,et al.  The Ontology of the Gene Ontology , 2003, AMIA.

[32]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[33]  Wynne Hsu,et al.  Mining association rules with multiple minimum supports , 1999, KDD '99.

[34]  Jiawei Han,et al.  CoMine: efficient mining of correlated patterns , 2003, Third IEEE International Conference on Data Mining.

[35]  Marti A. Hearst The Use of Categories and Clusters for Organizing Retrieval Results , 1999 .

[36]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[37]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[38]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[39]  Hsinchun Chen,et al.  Internet Categorization and Search: A Self-Organizing Approach , 1996, J. Vis. Commun. Image Represent..