论文信息 - AL 2 : Learning for Active Learning

AL 2 : Learning for Active Learning

We introduce AL2, a pool-based active learning approach that learns how to inform the active set selection. The framework is classifier-independent, amenable to different performance targets, applicable to both binary and multinomial classification for batch-mode active learning. Here, we consider a special instantiation, ALsubmodular, in which the choice of learning structure leads to a submodular objective function, therefore allowing for an efficient algorithm with optimality guarantee of 1−1/e. Statistically significant improvements over the state of the art are offered for two supervised learning methods, benchmark (UCI) datasets and the motivating sustainability application of land-cover prediction in the Arctic. 1 Motivation and Related Work Sustainability research is inherently a predictive science and can be crucially informed by accurate models for e.g. species distributions, land-use and climate change [6, 8]. Consider a predictive model for land-cover in the Arctic that relates ecological covariates to vegetation type. Such a model enables projections of the possible effects of climate scenarios by predicting the future composition of the land cover under drift of the ecological covariates [15]. Predictive accuracy and uncertainty estimates of the model are crucial and depend not only on the model complexity and inherent assumptions but also on the amount and quality of the training data. On one hand, ecological and environmental features such as biomass are readily available from remote sensing data sources. On the other hand though, collecting information on the actual vegetation cover in different parts of the Arctic is an expensive and time-consuming task performed by surveys over areas of large spatial extent. Hence, land-cover survey planning has to be done very carefully, in a targeted way, and with certain constraints in mind. This leads to experimental design and active learning (AL); for a comprehensive review, see Settles [18]. In pool-based active learning, one starts with a small training datasetL of labeled samples and a large pool U of unlabeled samples. On each iteration the active learner selects one or more samples from U , which are then labeled by an oracle (e.g., a human annotator) and added to the training dataset. The learner then retrains the predictive model and selects more samples for labeling. The goal of active learning is to achieve good performance of the predictive model with as few labeled samples as possible. Most active learning research has focused on sequential active learning, in which one greedily selects a single most informative unlabeled sample from U according to some utility measure. The most commonly used utility measures fall within the family of uncertainty sampling methods such as least confident sampling [3], margin sampling [17], and entropy sampling [21]. Another family of sequential active learning approaches is based on the query-by-committee (QBC) algorithm [20], where active learning selection is based on the disagreement of the committee classifiers about the label of an unlabeled sample. A key limitation of sequential active learning is the need for retraining which can be time consuming and in many applications is not even possible due to limited resources and expertise.

Theodoros Damoulas | Carla P. Gomes | Bistra Dilkina | Daniel Fink