Semi-Supervised Active Learning for Modeling Medical Concepts from Free Text

We apply a new active learning formulation to the problem of learning medical concepts from unstructured text. The new formulation is based on maximizing the mutual information that a sample labeling provides about the retrieval/classification model. This methodology is related to and extends the Query-by-Committee approach (QBC) (Seung et al., 1992) by exploiting unlabeled data in novel ways, beyond their common use only as potential query points. Unlike QBC, this method allows us to employ unlabeled data in addition to labeled data in order to select more appropriate samples for labeling. The samples thus chosen are both informative and also relevant according to a distribution of interest. This flexibility allows us to also tailor the model to arbitrary distributions relevant to the task at hand, in particular to the distribution of the test data. This formulation has implications in scenarios where the training and test distributions are different, or when a general model is adapted to a more specific model. Experiments were conducted to evaluate retrieval performance of natural-language text associated to various concepts of interest in the medical domain. We demonstrate the advantages of our formulation compared with QBC, the state-of-the art active learning approach, and against random sample selection.

[1]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[2]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[3]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[4]  Nicholas J. Belkin,et al.  A case for interaction: a study of interactive information retrieval behavior and effectiveness , 1996, CHI.

[5]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[6]  Brigham Anderson,et al.  Active learning for Hidden Markov Models: objective functions and algorithms , 2005, ICML.

[7]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[8]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[9]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[10]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[11]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[12]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[13]  Jinbo Bi,et al.  Active learning via transductive experimental design , 2006, ICML.

[14]  Masashi Sugiyama,et al.  Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error , 2006, J. Mach. Learn. Res..