Data Summarization at Scale: A Two-Stage Submodular Approach

The sheer scale of modern datasets has resulted in a dire need for summarization techniques that identify representative elements in a dataset. Fortunately, the vast majority of data summarization tasks satisfy an intuitive diminishing returns condition known as submodularity, which allows us to find nearly-optimal solutions in linear time. We focus on a two-stage submodular framework where the goal is to use some given training functions to reduce the ground set so that optimizing new functions (drawn from the same distribution) over the reduced set provides almost as much value as optimizing them over the entire ground set. In this paper, we develop the first streaming and distributed solutions to this problem. In addition to providing strong theoretical guarantees, we demonstrate both the utility and efficiency of our algorithms on real-world tasks including image summarization and ride-share optimization.

[1]  Zheng Wen,et al.  Adaptive Submodular Maximization in Bandit Setting , 2013, NIPS.

[2]  Yisong Yue,et al.  Linear Submodular Bandits and their Application to Diversified Retrieval , 2011, NIPS.

[3]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[4]  Pushmeet Kohli,et al.  Tractability: Practical Approaches to Hard Problems , 2013 .

[5]  Uriel Feige,et al.  On maximizing welfare when utility functions are subadditive , 2006, STOC '06.

[6]  Alexandros G. Dimakis,et al.  Streaming Weak Submodularity: Interpreting Neural Networks on the Fly , 2017, NIPS.

[7]  Andreas Krause,et al.  Distributed Submodular Maximization: Identifying Representative Elements in Massive Data , 2013, NIPS.

[8]  Andreas Krause,et al.  Learning Sparse Combinatorial Representations via Two-stage Submodular Maximization , 2016, ICML.

[9]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[10]  Roy Schwartz,et al.  Online Submodular Maximization with Preemption , 2015, SODA.

[11]  Sergei Vassilvitskii,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.

[12]  Andreas Krause,et al.  Streaming submodular maximization: massive data summarization on the fly , 2014, KDD.

[13]  Huy L. Nguyen,et al.  The Power of Randomization: Distributed Submodular Maximization on Massive Datasets , 2015, ICML.

[14]  Amin Karbasi,et al.  Do Less, Get More: Streaming Submodular Maximization with Subsampling , 2018, NeurIPS.

[15]  藤重 悟 Submodular functions and optimization , 1991 .

[16]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[17]  Hui Lin,et al.  Learning Mixtures of Submodular Shells with Application to Document Summarization , 2012, UAI.

[18]  Morteza Zadimoghaddam,et al.  Probabilistic Submodular Maximization in Sub-Linear Time , 2017, ICML.

[19]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[20]  Morteza Zadimoghaddam,et al.  Randomized Composable Core-sets for Distributed Submodular Maximization , 2015, STOC.

[21]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[22]  Morteza Zadimoghaddam,et al.  Deletion-Robust Submodular Maximization at Scale , 2017, ArXiv.

[23]  Jeff A. Bilmes,et al.  Online Submodular Minimization for Combinatorial Structures , 2011, ICML.

[24]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[25]  Andreas Krause,et al.  Near-Optimally Teaching the Crowd to Classify , 2014, ICML.

[26]  Jeff A. Bilmes,et al.  Submodularity for Data Selection in Statistical Machine Translation , 2014 .

[27]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[28]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[29]  Andreas Krause,et al.  Interactive Submodular Bandit , 2017, NIPS.

[30]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..