Streaming submodular maximization: massive data summarization on the fly

How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.

[1]  Laurence A. Wolsey,et al.  Best Algorithms for Approximating the Maximum of a Submodular Set Function , 1978, Math. Oper. Res..

[2]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[3]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[4]  Richard S. Varga,et al.  Proof of Theorem 5 , 1983 .

[5]  Richard S. Varga,et al.  Proof of Theorem 6 , 1983 .

[6]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[7]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[8]  U. Feige A threshold of ln n for approximating set cover , 1998, JACM.

[9]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[10]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[11]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[12]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[13]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[14]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Dafna Shahaf,et al.  Turning down the noise in the blogosphere , 2009, KDD.

[16]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[17]  Aaron Roth,et al.  Constrained Non-monotone Submodular Maximization: Offline and Secretary Algorithms , 2010, WINE.

[18]  Ravi Kumar,et al.  Max-cover in map-reduce , 2010, WWW '10.

[19]  Max A. Little,et al.  Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson's disease progression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Andreas Krause,et al.  Budgeted Nonparametric Learning from Data Streams , 2010, ICML.

[21]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[22]  Carlos Guestrin,et al.  Beyond keyword search: discovering relevant scientific literature , 2011, KDD.

[23]  Guy E. Blelloch,et al.  Linear-work greedy parallel approximate set cover and variants , 2011, SPAA '11.

[24]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[25]  Ben Taskar,et al.  Near-Optimal MAP Inference for Determinantal Point Processes , 2012, NIPS.

[26]  Jure Leskovec,et al.  Inferring Networks of Diffusion and Influence , 2012, ACM Trans. Knowl. Discov. Data.

[27]  Thorsten Joachims,et al.  Temporal corpus summarization using submodular word coverage , 2012, CIKM '12.

[28]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[29]  Andreas Krause,et al.  Distributed Submodular Maximization: Identifying Representative Elements in Massive Data , 2013, NIPS.

[30]  Zoubin Ghahramani,et al.  Scaling the Indian Buffet Process via Submodular Maximization , 2013, ICML.

[31]  Pushmeet Kohli,et al.  Tractability: Practical Approaches to Hard Problems , 2013 .

[32]  Anirban Dasgupta,et al.  Summarization Through Submodularity and Dispersion , 2013, ACL.

[33]  Morteza Zadimoghaddam,et al.  Submodular secretary problem and extensions , 2013, TALG.

[34]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[35]  Jan Vondrák,et al.  Fast algorithms for maximizing submodular functions , 2014, SODA.

[36]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[37]  Sergei Vassilvitskii,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.