Fixing Mini-batch Sequences with Hierarchical Robust Partitioning

We propose a general and efficient hierarchical robust partitioning framework to generate a deterministic sequence of mini-batches, one that offers assurances of being high quality, unlike a randomly drawn sequence. We compare our deterministically generated mini-batch sequences to randomly generated sequences; we show that, on a variety of deep learning tasks, the deterministic sequences significantly beat the mean and worst case performance of the random sequences, and often outperforms the best of the random sequences. Our theoretical contributions include a new algorithm for the robust submodular partition problem subject to cardinality constraints (which is used to construct mini-batch sequences), and show in general that the algorithm is fast and has good theoretical guarantees; we also show a more efficient hierarchical variant of the algorithm with similar guarantees under mild assumptions.

[1]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[2]  H. B. McMahan,et al.  Robust Submodular Observation Selection , 2008 .

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[5]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[6]  Zeyuan Allen Zhu,et al.  Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters , 2016, NIPS.

[7]  Hedvig Kjellstrom,et al.  Determinantal Point Processes for Mini-Batch Diversification , 2017, UAI 2017.

[8]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[9]  Patrick Thiran,et al.  Stochastic Optimization with Bandit Sampling , 2017, ArXiv.

[10]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[11]  Andreas Krause,et al.  Lazier Than Lazy Greedy , 2014, AAAI.

[12]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[13]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[14]  Jeff A. Bilmes,et al.  Submodularity beyond submodular energies: Coupling edges in graph cuts , 2011, CVPR 2011.

[15]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[16]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  D. Golovin Max-min fair allocation of indivisible goods , 2005 .

[18]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[22]  Andreas Krause,et al.  Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies , 2008, J. Mach. Learn. Res..

[23]  Maya R. Gupta,et al.  Constrained Interacting Submodular Groupings , 2018, ICML.

[24]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[25]  Xiaodong Cui,et al.  Data Augmentation for deep neural network acoustic modeling , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[27]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[28]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[29]  Rishabh K. Iyer,et al.  Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications , 2015, NIPS.

[30]  Subhash Khot,et al.  Approximation Algorithms for the Max-Min Allocation Problem , 2007, APPROX-RANDOM.

[31]  Rama Chellappa,et al.  Submodular Attribute Selection for Action Recognition in Video , 2014, NIPS.

[32]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[33]  Amin Saberi,et al.  An approximation algorithm for max-min fair allocation of indivisible goods , 2007, STOC '07.

[34]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[35]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[36]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[37]  Satoru Iwata,et al.  Minimum Average Cost Clustering , 2010, NIPS.