A Serial Sample Selection Framework for Active Learning

Active Learning is a machine learning and data mining technique that selects the most informative samples for labeling and uses them as training data. It aims to obtain a high performance classifier by labeling as little data as possible from large amount of unlabeled samples, which means sampling strategy is the core issue. Existing approaches either tend to ignore information in unlabeled data and are prone to querying outliers or noise samples, or calculate large amounts of non-informative samples leading to significant computation cost. In order to solve above problems, this paper proposed a serial active learning framework. It first measures uncertainty of unlabeled samples and selects the most uncertain sample set. From which, it further generates the most representative sample set based on the mutual information criterion. Finally, the framework selects the most informative sample from the most representative sample set based on expected error reduction strategy. Experimental results on multiple datasets show that our approach outperforms Random Sampling and the state of the art adaptive active learning method.

[1]  Fredrik Olsson,et al.  A literature survey of active machine learning in the context of natural language processing , 2009 .

[2]  Naonori Ueda,et al.  Adaptive semi-supervised learning on labeled and unlabeled data with different distributions , 2012, Knowledge and Information Systems.

[3]  Trevor Darrell,et al.  Active Learning with Gaussian Processes for Object Categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  William J. Emery,et al.  Active Learning Methods for Remote Sensing Image Classification , 2009, IEEE Transactions on Geoscience and Remote Sensing.

[5]  Nikolaos Papanikolopoulos,et al.  Scalable Active Learning for Multiclass Image Classification , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Xin Li,et al.  Adaptive Active Learning for Image Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[8]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[9]  Carl E. Rasmussen,et al.  Gaussian Processes for Machine Learning (GPML) Toolbox , 2010, J. Mach. Learn. Res..

[10]  Yolande Belaïd,et al.  A Stream-Based Semi-supervised Active Learning Approach for Document Classification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[11]  Yehuda Koren,et al.  Advances in Collaborative Filtering , 2011, Recommender Systems Handbook.

[12]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[13]  Masashi Sugiyama,et al.  Active Learning in Recommender Systems , 2011, Recommender Systems Handbook.

[14]  Jieping Ye,et al.  Querying discriminative and representative samples for batch mode active learning , 2013, KDD.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Lejla Batina,et al.  Mutual Information Analysis: a Comprehensive Study , 2011, Journal of Cryptology.

[17]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[18]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Bin Li,et al.  A survey on instance selection for active learning , 2012, Knowledge and Information Systems.