Learning-Based Frequency Estimation Algorithms

Estimating the frequencies of elements in a data stream is a fundamental task in data analysis and machine learning. The problem is typically addressed using streaming algorithms which can process very large data using limited storage. Today’s streaming algorithms, however, cannot exploit patterns in their input to improve performance. We propose a new class of algorithms that automatically learn relevant patterns in the input data and use them to improve its frequency estimates. The proposed algorithms combine the benefits of machine learning with the formal guarantees available through algorithm theory. We prove that our learning-based algorithms have lower estimation errors than their non-learning counterparts. We also evaluate our algorithms on two real-world datasets and demonstrate empirically their performance gains.

[1]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[2]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[3]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[4]  Hossein Jowhari,et al.  Tight bounds for Lp samplers, finding duplicates in streams, and related problems , 2010, PODS.

[5]  He He,et al.  Learning to Search in Branch and Bound Algorithms , 2014, NIPS.

[6]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[7]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[8]  Nello Cristianini,et al.  Scalable Preference Learning from Data Streams , 2015, WWW.

[9]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[10]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[11]  Partha Pratim Talukdar,et al.  Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch , 2013, AISTATS.

[12]  Piotr Indyk,et al.  Sparse Recovery Using Sparse Matrices , 2010, Proceedings of the IEEE.

[13]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[14]  Richard G. Baraniuk,et al.  A deep learning approach to structured signal recovery , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[15]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[16]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[17]  Sergei Vassilvitskii,et al.  Competitive caching with machine learned advice , 2018, ICML.

[18]  Richard G. Baraniuk,et al.  MISSION: Ultra Large-Scale Feature Selection using Count-Sketches , 2018, ICML.

[19]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Graham Cormode,et al.  Summarizing and Mining Skewed Data Streams , 2005, SDM.

[21]  Wei Liu,et al.  Learning to Hash for Indexing Big Data—A Survey , 2015, Proceedings of the IEEE.

[22]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[23]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[24]  Michael Mitzenmacher,et al.  A Model for Learned Bloom Filters and Optimizing by Sandwiching , 2018, NeurIPS.

[25]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[26]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[27]  Graham Cormode,et al.  Sketch Algorithms for Estimating Point Queries in NLP , 2012, EMNLP.

[28]  Maria-Florina Balcan,et al.  Learning to Branch , 2018, ICML.

[29]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[30]  Eric Price,et al.  Improved Concentration Bounds for Count-Sketch , 2012, SODA.

[31]  S. Muthukrishnan,et al.  Heavy-Hitter Detection Entirely in the Data Plane , 2016 .

[32]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[33]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[34]  Alexandros G. Dimakis,et al.  Compressed Sensing using Generative Models , 2017, ICML.

[35]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[36]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[37]  Volkan Cevher,et al.  Learning-Based Compressive Subsampling , 2015, IEEE Journal of Selected Topics in Signal Processing.

[38]  Le Song,et al.  2 Common Formulation for Greedy Algorithms on Graphs , 2018 .

[39]  Stuart E. Schechter,et al.  Popularity Is Everything: A New Approach to Protecting Passwords from Statistical-Guessing Attacks , 2010, HotSec.

[40]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[41]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[42]  Abhishek Kumar,et al.  Data streaming algorithms for efficient and accurate estimation of flow size distribution , 2004, SIGMETRICS '04/Performance '04.

[43]  Richard G. Baraniuk,et al.  DeepCodec: Adaptive sensing and recovery via deep convolutional neural networks , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).