Active learning through two-stage clustering

Clustering-based active learning approaches take advantage of the structure of the data to select representative instances. However, some algorithms are either inefficient or only applicable to some data. In this paper, we propose an effective and adaptive algorithm that will be called active learning through two-stage clustering (ALTA). The first stage is data preprocessing using the two-round-clustering algorithm to obtain $\sqrt n $ small blocks, where n is the number of instances. For each block, the closest instance of the center is selected as the representative. The second stage is the active learning of representative instances through density clustering. This stage consists of a number of iterations of density clustering, labeling and classification. In general, data preprocessing reduces the size of the data and the complexity of the algorithm. The combination of distance vector clustering and density clustering makes the algorithm more adaptive. Experiments are performed in comparison against the state-of-the-art active learning algorithms on nine datasets. Results demonstrate that the new algorithm has higher classification accuracy with the same number of labeled data.

[1]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[2]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[3]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[4]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[5]  Yiyu Yao,et al.  A Partition Model of Granular Computing , 2004, Trans. Rough Sets.

[6]  Bin Li,et al.  A survey on instance selection for active learning , 2012, Knowledge and Information Systems.

[7]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[8]  Tsuhan Chen,et al.  An active learning framework for content-based information retrieval , 2002, IEEE Trans. Multim..

[9]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[10]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[11]  Dong Yu,et al.  Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global Entropy Reduction Maximization Criterion Computer Speech and Language Article in Press Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global E , 2022 .

[12]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[13]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[14]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[15]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[16]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[17]  Andrew McCallum,et al.  Toward Optimal Active Learning through Monte Carlo Estimation of Error Reduction , 2001, ICML 2001.

[18]  Deng Cai,et al.  Manifold Adaptive Experimental Design for Text Categorization , 2012, IEEE Transactions on Knowledge and Data Engineering.

[19]  Lei Zhang,et al.  Research on an Optimized C4.5 Algorithm Based on Rough Set Theory , 2012 .

[20]  Hanqing Lu,et al.  Asymmetric propagation based batch mode active learning for image retrieval , 2013, Signal Process..

[21]  William W. Cohen,et al.  Semi-Supervised Classification of Network Data Using Very Few Labels , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[22]  Ran Gilad-Bachrach Kernel Query By Committee ( KQBC ) , 2003 .

[23]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[24]  Lin Sun,et al.  Feature selection using rough entropy-based uncertainty measures in incomplete decision systems , 2012, Knowl. Based Syst..

[25]  Min Wang,et al.  Active learning through density clustering , 2017, Expert Syst. Appl..

[26]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[27]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[28]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[29]  Rong Jin,et al.  Active Learning by Querying Informative and Representative Examples , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.