Research Paper: Optimal Training Sets for Bayesian Prediction of MeSH® Assignment

OBJECTIVES The aim of this study was to improve naïve Bayes prediction of Medical Subject Headings (MeSH) assignment to documents using optimal training sets found by an active learning inspired method. DESIGN The authors selected 20 MeSH terms whose occurrences cover a range of frequencies. For each MeSH term, they found an optimal training set, a subset of the whole training set. An optimal training set consists of all documents including a given MeSH term (C1 class) and those documents not including a given MeSH term (C(-1) class) that are closest to the C1 class. These small sets were used to predict MeSH assignments in the MEDLINE database. MEASUREMENTS Average precision was used to compare MeSH assignment using the naïve Bayes learner trained on the whole training set, optimal sets, and random sets. The authors compared 95% lower confidence limits of average precisions of naïve Bayes with upper bounds for average precisions of a K-nearest neighbor (KNN) classifier. RESULTS For all 20 MeSH assignments, the optimal training sets produced nearly 200% improvement over use of the whole training sets. In 17 of those MeSH assignments, naïve Bayes using optimal training sets was statistically better than a KNN. In 15 of those, optimal training sets performed better than optimized feature selection. Overall naïve Bayes averaged 14% better than a KNN for all 20 MeSH assignments. Using these optimal sets with another classifier, C-modified least squares (CMLS), produced an additional 6% improvement over naïve Bayes. CONCLUSION Using a smaller optimal training set greatly improved learning with naïve Bayes. The performance is superior to a KNN. The small training set can be used with other sophisticated learning methods, such as CMLS, where using the whole training set would not be feasible.

[1]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[2]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[3]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[4]  W. John Wilbur,et al.  Boosting naïve Bayesian learning on a large subset of MEDLINE , 2000, AMIA.

[5]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[6]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[7]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[8]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[9]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[10]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[11]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[12]  Jianchang Mao,et al.  Scaling-up support vector machines using boosting algorithm , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[13]  Evangelos E. Milios,et al.  Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[14]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[15]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[16]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[17]  J Fowler,et al.  Automated MeSH indexing of the World-Wide Web. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[18]  Daniel Boley,et al.  Training Support Vector Machines Using Adaptive Clustering , 2004, SDM.

[19]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[20]  J Fowler,et al.  Categorization by reference: a novel approach to MeSH term assignment. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[21]  W. John Wilbur,et al.  Automatic MeSH term assignment and quality assessment , 2001, AMIA.

[22]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[23]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[24]  W. John Wilbur,et al.  A Strategy for Assigning New Concepts in the MEDLINE Database , 2005, AMIA.

[25]  Randolph A. Miller,et al.  Research Paper: An Experiment Comparing Lexical and Statistical Methods for Extracting MeSH Terms from Clinical Free Text , 1998, J. Am. Medical Informatics Assoc..

[26]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[27]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[28]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[29]  Gerald Salton,et al.  Automatic text processing , 1988 .

[30]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[31]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[32]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[33]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[34]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[35]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[36]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[37]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .