Active learning with direct query construction

Active learning may hold the key for solving the data scarcity problem in supervised learning, i.e., the lack of labeled data. Indeed, labeling data is a costly process, yet an active learner may request labels of only selected instances, thus reducing labeling work dramatically. Most previous works of active learning are, however, pool-based; that is, a pool of unlabeled examples is given and the learner can only select examples from the pool to query for their labels. This type of active learning has several weaknesses. In this paper we propose novel active learning algorithms that construct examples directly to query for labels. We study both a specific active learner based on the decision tree algorithm, and a general active learner that can work with any base learning algorithm. As there is no restriction on what examples to be queried, our methods are shown to often query fewer examples to reduce the predictive error quickly. This casts doubt on the usefulness of the pool in pool-based active learning. Nevertheless, our methods can be easily adapted to work with a given pool of unlabeled examples.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[3]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[4]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[5]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[6]  Michael Lindenbaum,et al.  Selective Sampling for Nearest Neighbor Classifiers , 1999, Machine Learning.

[7]  Ian Witten,et al.  Data Mining , 2000 .

[8]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[9]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[10]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[11]  Russell Greiner,et al.  Learning and Classifying Under Hard Budgets , 2005, ECML.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Michael Stonebraker,et al.  The Morgan Kaufmann Series in Data Management Systems , 1999 .

[14]  Dan Roth,et al.  Learning cost-sensitive active classifiers , 2002, Artif. Intell..

[15]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[16]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[17]  Dragos D. Margineantu,et al.  Active Cost-Sensitive Learning , 2005, IJCAI.

[18]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[19]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[20]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[21]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[22]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.