Fast Rotation Kernel Density Estimation over Data Streams

Kernel density estimation method is a powerful tool and is widely used in many important real-world applications such as anomaly detection and statistical learning. Unfortunately, current kernel methods suffer from high computational or space costs when dealing with large-scale, high-dimensional datasets, especially when the datasets of interest are given in a stream fashion. Although there are sketch methods designed for kernel density estimation over data streams, they still suffer from high computational costs. To address this problem, in this paper, we propose a novel Rotation Kernel. The Rotation Kernel is based on a Rotation Hash method and is much faster to compute. To achieve memory-efficient kernel density estimation over data streams, we design a method, RKD-Sketch, which compresses high dimensional data streams into a small array of integer counters. We conduct extensive experiments on both synthetic and real-world datasets, and experimental results demonstrate that our RKD-Sketch saves up to 216 times computational resources and up to 104 times space resources than state-of-the-arts. Furthermore, we apply our Rotation Kernel in active learning. Results show that our method achieves up to 256 times speedup and saves up to 13 times space to achieve the same accuracy as the baseline methods.

[1]  Moses Charikar,et al.  Hashing-Based-Estimators for Kernel Density in High Dimensions , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[2]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[3]  Philip Levis,et al.  Rehashing Kernel Evaluation in High Dimensions , 2019, ICML.

[4]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[5]  Hans-Peter Kriegel,et al.  Density‐based clustering , 2011, WIREs Data Mining Knowl. Discov..

[6]  Vyas Sekar,et al.  Privacy for Free: Communication-Efficient Learning with Differential Privacy Using Sketches , 2019, ArXiv.

[7]  Shigang Chen,et al.  Universal Online Sketch for Tracking Heavy Hitters and Estimating Moments of Data Streams , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[8]  Björn Scheuermann,et al.  P2KMV: A Privacy-preserving Counting Sketch for Efficient and Accurate Set Intersection Cardinality Estimations , 2018, IACR Cryptol. ePrint Arch..

[9]  Vladimir Braverman,et al.  FetchSGD: Communication-Efficient Federated Learning with Sketching , 2020, ICML.

[10]  Anshumali Shrivastava,et al.  STORM: Foundations of End-to-End Empirical Risk Minimization on the Edge , 2020, ArXiv.

[11]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[12]  Yuanming Zhang,et al.  A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets , 2019, KDD.

[13]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[14]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[15]  Jing Tao,et al.  Approximately Counting Butterflies in Large Bipartite Graph Streams , 2021, IEEE Transactions on Knowledge and Data Engineering.

[16]  Tong Yang,et al.  Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams , 2017, Proc. VLDB Endow..

[17]  Hans-Peter Kriegel,et al.  Generalized Outlier Detection with Flexible Kernel Density Estimates , 2014, SDM.

[18]  Haibo He,et al.  SOMKE: Kernel Density Estimation Over Data Streams by Sequences of Self-Organizing Maps , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[20]  Dino Ienco,et al.  High density-focused uncertainty sampling for active learning over evolving stream data , 2014, BigMine.

[21]  Bernhard Seeger,et al.  Cluster Kernels: Resource-Aware Kernel Density Estimators over Streaming Data , 2006, IEEE Transactions on Knowledge and Data Engineering.

[22]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[23]  Jonathan J. Hull A Database for Handwritten Text Recognition Research Some of the criticisms of experimental pattern recognition that are related to the replication of experiments and the comparison , 1994 .

[24]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[25]  Jing Tao,et al.  A Fast Sketch Method for Mining User Similarities Over Fully Dynamic Graph Streams , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[26]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[27]  Chen Luo,et al.  Arrays of (locality-sensitive) Count Estimators (ACE): Anomaly Detection on the Edge , 2018, WWW.

[28]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[29]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[30]  Yu Sun,et al.  REPT: A Streaming Algorithm of Approximating Global and Local Triangle Counts in Parallel , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[31]  Anshumali Shrivastava,et al.  Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data , 2019, WWW.

[32]  Yuanming Zhang,et al.  Streaming Algorithms for Estimating High Set Similarities in LogLog Space , 2021, IEEE Transactions on Knowledge and Data Engineering.

[33]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[34]  Richard Baraniuk,et al.  Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data , 2020, ICML.

[35]  Nir Ailon,et al.  Fast and RIP-Optimal Transforms , 2013, Discrete & Computational Geometry.

[36]  Anshumali Shrivastava,et al.  Fast and Accurate Stochastic Gradient Estimation , 2019, NeurIPS.

[37]  Anshumali Shrivastava,et al.  A One-Pass Private Sketch for Most Machine Learning Tasks , 2020, ArXiv.

[38]  Alexander Hinneburg,et al.  DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation , 2007, IDA.

[39]  Cody A. Coleman,et al.  Similarity Search for Efficient Active Learning and Search of Rare Concepts , 2020, AAAI.

[40]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[41]  Arkady Yerukhimovich,et al.  Differentially-Private Multi-Party Sketching for Large-Scale Statistics , 2020, IACR Cryptol. ePrint Arch..

[42]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[43]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[44]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[45]  Yu Hen Hu,et al.  Vehicle classification in distributed sensor networks , 2004, J. Parallel Distributed Comput..

[46]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[47]  Piotr Indyk,et al.  Space and Time Efficient Kernel Density Estimation in High Dimensions , 2019, NeurIPS.

[48]  Moses Charikar,et al.  Local Density Estimation in High Dimensions , 2018, ICML.

[49]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[50]  Tong Yang,et al.  WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams , 2020, KDD.

[51]  Anderson Rocha,et al.  Multiclass From Binary: Expanding One-Versus-All, One-Versus-One and ECOC-Based Approaches , 2014, IEEE Transactions on Neural Networks and Learning Systems.