Active learning using pre-clustering

The paper is concerned with two-class active learning. While the common approach for collecting data in active learning is to select samples close to the classification boundary, better performance can be achieved by taking into account the prior data distribution. The main contribution of the paper is a formal framework that incorporates clustering into active learning. The algorithm first constructs a classifier on the set of the cluster representatives, and then propagates the classification decision to the other samples via a local noise model. The proposed model allows to select the most representative samples as well as to avoid repeatedly labeling samples in the same cluster. During the active learning process, the clustering is adjusted using the coarse-to-fine strategy in order to balance between the advantage of large clusters and the accuracy of the data representation. The results of experiments in image databases show a better performance of our algorithm compared to the current methods.

[1]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[2]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[3]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[4]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[5]  Mia Hubert,et al.  Integrating robust clustering techniques in S-PLUS , 1997 .

[6]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[7]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[8]  Nello Cristianini,et al.  Query Learning with Large Margin Classi ersColin , 2000 .

[9]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[10]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[11]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[12]  Eiji Watanabe,et al.  A Distributed-Cooperative Learning Algorithm for Multi-Layered Neural Networks using a PC Cluster , 2001 .

[13]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[14]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[15]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[16]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[17]  Tsuhan Chen,et al.  An active learning framework for content-based information retrieval , 2002, IEEE Trans. Multim..

[18]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[19]  Marcel Worring,et al.  Face detection by aggregated Bayesian network classifiers , 2001, Pattern Recognit. Lett..

[20]  ChengXiang Zhai,et al.  Active Feedback - UIUC TREC-2003 HARD Experiments , 2003, TREC.

[21]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[22]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[23]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[24]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.