Multi-label active learning: key issues and a novel query strategy

Active learning is an iterative supervised learning task where learning algorithms can actively query an oracle, i.e. a human annotator that understands the nature of the problem, to obtain the ground truth. The motivation behind this approach is to allow the learner to interactively choose the data it will learn from, which can lead to significantly less annotation cost, faster training and improved performance. Active learning is appropriate for machine learning applications where labeled data is costly to obtain but unlabeled data is abundant. Most importantly, it permits a learning model to evolve and adapt to new data unlike conventional supervised learning. Although active learning has been widely considered for single-label learning, applications to multi-label learning have been more limited. In this work, we present the general framework to apply active learning to multi-label data, discussing the key issues that need to be considered in pool-based multi-label active learning and how existing solutions in the literature deal with each of these issues. We further propose a novel aggregation method for evaluating which instances are to be annotated. Extensive experiments on 13 multi-label data sets with different characteristics and under two different applications settings (transductive, inductive) convey a consistent advantage of our proposed approach against the rest of the approaches and, most importantly, against passive supervised learning and reveal interesting aspects related mainly to the properties of the data sets, and secondarily to the application settings.

[1]  Geoff Holmes,et al.  Active Learning with Evolving Streaming Data , 2011, ECML/PKDD.

[2]  Andrea Esuli,et al.  Active Learning Strategies for Multi-Label Text Classification , 2009, ECIR.

[3]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[4]  Grigorios Tsoumakas,et al.  Active Learning Algorithms for Multi-label Data , 2016, AIAI.

[5]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[6]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[7]  Songcan Chen,et al.  Multi-label active learning by model guided distribution matching , 2016, Frontiers of Computer Science.

[8]  Derek Greene,et al.  Score Normalization and Aggregation for Active Learning in Multi-label Classification , 2010 .

[9]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[10]  Klaus Brinker,et al.  On Active Learning in Multi-label Classification , 2005, GfKl.

[11]  Grigorios Tsoumakas,et al.  Introduction to the special issue on learning from multi-label data , 2012, Machine Learning.

[12]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[13]  Stefanie Nowak,et al.  The CLEF 2011 Photo Annotation and Concept-based Retrieval Tasks , 2011, CLEF.

[14]  Zheng Chen,et al.  Effective multi-label active learning for text classification , 2009, KDD.

[15]  Pengpeng Zhao,et al.  Multi-Label Active Learning with Chi-Square Statistics for Image Classification , 2015, ICMR.

[16]  Yang Wang,et al.  Multilabel Image Classification Via High-Order Label Correlation Driven Active Learning , 2014, IEEE Transactions on Image Processing.

[17]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[18]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[19]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[20]  Philip S. Yu,et al.  Active Learning: A Survey , 2014, Data Classification: Algorithms and Applications.

[21]  Alneu de Andrade Lopes,et al.  A parameter-free label propagation algorithm using bipartite heterogeneous networks for text classification , 2014, SAC.