Very Fast Streaming Submodular Function Maximization

Data summarization has become a valuable tool in understanding even terabytes of data. Due to their compelling theoretical properties, submodular functions have been in the focus of summarization algorithms. These algorithms offer worst-case approximations guarantees to the expense of higher computation and memory requirements. However, many practical applications do not fall under this worst-case, but are usually much more well-behaved. In this paper, we propose a new submodular function maximization algorithm called ThreeSieves, which ignores the worst-case, but delivers a good solution in high probability. It selects the most informative items from a data-stream on the fly and maintains a provable performance on a fixed memory budget. In an extensive evaluation, we compare our method against $6$ other methods on $8$ different datasets with and without concept drift. We show that our algorithm outperforms current state-of-the-art algorithms and, at the same time, uses fewer resources. Last, we highlight a real-world use-case of our algorithm for data summarization in gamma-ray astronomy. We make our code publicly available at this https URL.

[1]  Rishabh K. Iyer,et al.  Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions , 2013, NIPS.

[2]  Alan Kuhnle Quick Streaming Algorithms for Maximization of Monotone Submodular Functions in Linear Time , 2021, AISTATS.

[3]  Roy Schwartz,et al.  Online Submodular Maximization with Preemption , 2015, SODA.

[4]  Katharina Morik,et al.  Online Analysis of High-Volume Data Streams in Astroparticle Physics , 2015, ECML/PKDD.

[5]  Christopher Kanan,et al.  Stream-51: Streaming Classification and Novelty Detection from Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[7]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[8]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[9]  Silvio Borer,et al.  Normalization in Support Vector Machines , 2001, DAGM-Symposium.

[10]  Ola Svensson,et al.  Beyond 1/2-Approximation for Submodular Maximization on Massive Data Streams , 2018, ICML.

[11]  Baharan Mirzasoleiman,et al.  Fast Constrained Submodular Maximization: Personalized Data Summarization , 2016, ICML.

[12]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[13]  W. Lustermann,et al.  FACT -- the First Cherenkov Telescope using a G-APD Camera for TeV Gamma-ray Astronomy (HEAD 2010) , 2010, 1010.2397.

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  Silvio Lattanzi,et al.  Submodular Streaming in All its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity , 2019, ICML.

[16]  Andreas Krause,et al.  Streaming submodular maximization: massive data summarization on the fly , 2014, KDD.

[17]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[18]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[19]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[20]  W. Lustermann,et al.  Calibration and performance of the photon sensor response of FACT — the first G-APD Cherenkov telescope , 2014, 1403.5747.

[21]  B. Jovanovic,et al.  A Look at the Rule of Three , 1997 .

[22]  On-Site Gamma-Hadron Separation with Deep Learning on FPGAs , 2020, ECML/PKDD.

[23]  Katharina Morik,et al.  Summary Extraction on Data Streams in Embedded Systems , 2017, IOTSTREAMING@PKDD/ECML.

[24]  Vahab S. Mirrokni,et al.  Approximating submodular functions everywhere , 2009, SODA.

[25]  U. Feige,et al.  Maximizing Non-monotone Submodular Functions , 2011 .

[26]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[27]  Zheng Wen,et al.  Optimal Greedy Diversity for Recommendation , 2015, IJCAI.

[28]  Neil D. Lawrence,et al.  Fast Sparse Gaussian Process Methods: The Informative Vector Machine , 2002, NIPS.

[29]  Ola Svensson,et al.  The one-way communication complexity of submodular maximization with applications to streaming and robustness , 2020, STOC.

[30]  Reid A. Johnson,et al.  Calibrating Probability with Undersampling for Unbalanced Classification , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[31]  Amit Chakrabarti,et al.  Submodular maximization meets streaming: matchings, matroids, and more , 2013, Math. Program..

[32]  W. Lustermann,et al.  Design and operation of FACT - the first G-APD Cherenkov telescope , 2013, 1304.1710.

[33]  Kent Quanrud,et al.  Streaming Algorithms for Submodular Function Maximization , 2015, ICALP.

[34]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[35]  Andreas Krause,et al.  Budgeted Nonparametric Learning from Data Streams , 2010, ICML.