Design and analysis of the WCCI 2010 active learning challenge

We organized a data mining challenge on “active learning” for IJCNN/WCCI 2010, addressing machine learning problems where labeling data is expensive, but large amounts of unlabeled data are available at low cost. Examples include handwriting and speech recognition, document classification, vision tasks, drug design using recombinant molecules and protein engineering. Such problems might be tackled from different angles: learning from unlabeled data or active learning. In the former case, the algorithms must satisfy themselves with the limited amount of labeled data and capitalize on the unlabeled data with semi-supervised learning methods. Several challenges have addressed this problem in the past. In the latter case, the algorithms may place a limited number of queries to get new sample labels. The goal in that case is to optimize the queries and the problem is referred to as active learning. While the problem of active learning is of great importance, organizing a challenge in that area is non trivial. This is the problem we have addressed, and we describe our approach in this paper. The “active learning” challenge is part of the WCCI 2010 competition program (http://www.wcci2010. org/competition-program). The website of the challenge remains open for submission of new methods beyond the termination of the challenge as a resource for students and researchers (http://clopinet.com/al).

[1]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[2]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[3]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[4]  Isabelle Guyon,et al.  Results of the Active Learning Challenge , 2011, Active Learning and Experimental Design @ AISTATS.

[5]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[6]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[7]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.

[8]  Isabelle Guyon,et al.  Design and analysis of the KDD cup 2009: fast scoring on a large orange customer database , 2009, SKDD.

[9]  Eugene Tuv,et al.  Tree-Based Ensembles with Dynamic Soft Feature Selection , 2006, Feature Extraction.

[10]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[11]  I. Guyon,et al.  Performance Prediction Challenge , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[12]  Isabelle Guyon,et al.  What Size Test Set Gives Good Error Rate Estimates? , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[14]  V. Sindhwani,et al.  Newton Methods for Fast Solution of Semi-supervised Linear SVMs , 2006 .

[15]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[16]  Isabelle Guyon,et al.  Design and Analysis of the Causation and Prediction Challenge , 2008, WCCI Causation and Prediction Challenge.

[17]  Daphne Koller,et al.  Active Learning for Parameter Estimation in Bayesian Networks , 2000, NIPS.

[18]  Isabelle Guyon,et al.  Analysis of the IJCNN 2007 agnostic learning vs. prior knowledge challenge , 2008, Neural Networks.

[19]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[20]  Timothy X. Brown,et al.  Reinforcement Learning for Call Admission Control and Routing under Quality of Service Constraints in Multimedia Networks , 2002, Machine Learning.

[21]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[22]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[23]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.