Selection Via Proxy: Efficient Data Selection For Deep Learning

Data selection methods such as active learning and core-set selection are useful tools for machine learning on large datasets, but they can be prohibitively expensive to apply in deep learning. Unlike in other areas of machine learning, the feature representations that these techniques depend on are learned in deep learning rather than given, which takes a substantial amount of training time. In this work, we show that we can significantly improve the computational efficiency of data selection in deep learning by using a much smaller proxy model to perform data selection for tasks that will eventually require a large target model (e.g., selecting data points to label for active learning). In deep learning, we can scale down models by removing hidden layers or reducing their dimension to create proxies that are an order of magnitude faster. Although these small proxy models have significantly higher error, we find that they empirically provide useful rankings for data selection that have a high correlation with those of larger models. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks. For active learning, applying SVP to Sener and Savarese [2018]'s recent method for active learning in deep learning gives a 4x improvement in execution time while yielding the same model accuracy. For core-set selection, we show that a proxy model that trains 10x faster than a target ResNet164 model on CIFAR10 can be used to remove 50% of the training data without compromising the accuracy of the target model, making end-to-end training time improvements via core-set selection possible.

[1]  Baharan Mirzasoleiman,et al.  Select Via Proxy: Efficient Data Selection For Training Deep Networks , 2018 .

[2]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.

[3]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Byron C. Wallace,et al.  How transferable are the datasets collected by active learners? , 2018, ArXiv.

[5]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[6]  Trevor Campbell,et al.  Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent , 2018, ICML.

[7]  Trevor Campbell,et al.  Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.

[8]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[9]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[12]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[15]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[17]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[18]  Franziska Abend,et al.  Facility Location Concepts Models Algorithms And Case Studies , 2016 .

[19]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[20]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[23]  Udo Hahn,et al.  An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data , 2007, EMNLP.

[24]  Francisco Casacuberta,et al.  Active Learning for Interactive Neural Machine Translation of Data Streams , 2018, CoNLL.

[25]  Burr Settles,et al.  From Theories to Queries: Active Learning in Practice , 2011 .

[26]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[27]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[28]  Ruimao Zhang,et al.  Cost-Effective Active Learning for Deep Image Classification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Trevor Campbell,et al.  Automated Scalable Bayesian Inference via Hilbert Coresets , 2017, J. Mach. Learn. Res..

[30]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[31]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[33]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[34]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[35]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[36]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[37]  Adam Gaier,et al.  Weight Agnostic Neural Networks , 2019, NeurIPS.

[38]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[39]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[40]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[41]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[43]  Anima Anandkumar,et al.  Deep Active Learning for Named Entity Recognition , 2017, Rep4NLP@ACL.

[44]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[45]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[46]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[47]  Percy Liang,et al.  On the Relationship between Data Efficiency and Error for Uncertainty Sampling , 2018, ICML.

[48]  Bin Ma,et al.  Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  Reza Zanjirani Farahani,et al.  Facility location: concepts, models, algorithms and case studies , 2009 .

[51]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[52]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[53]  Jieping Ye,et al.  Querying discriminative and representative samples for batch mode active learning , 2013, KDD.

[54]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[57]  Yarin Gal,et al.  BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning , 2019, NeurIPS.

[58]  Zoubin Ghahramani,et al.  Bayesian Active Learning for Classification and Preference Learning , 2011, ArXiv.