Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm

Active learning algorithms actively select training examples to acquire labels from domain experts, which are very effective to reduce human labeling effort in the context of supervised learning. To reduce computational time in training, as well as provide more convenient user interaction environment, it is necessary to select batches of new training examples instead of a single example. Batch mode active learning algorithms incorporate a diversity measure to construct a batch of diversified candidate examples. Existing approaches use greedy algorithms to make it feasible to the scale of thousands of data. Greedy algorithms, however, are not efficient enough to scale to even larger real world classification applications, which contain millions of data. In this paper, we present an extremely efficient active learning algorithm. This new active learning algorithm achieves the same results as the traditional greedy algorithm, while the run time is reduced by a factor of several hundred times. We prove that the objective function of the algorithm is submodular, which guarantees to find the same solution as the greedy algorithm. We evaluate our approach on several largescale real-world text classification problems, and show that our new approach achieves substantial speedups, while obtaining the same classification accuracy.