Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary

Along with the prosperity of the Mobile Internet, a large amount of stream data has emerged. Stream data cannot be completely stored in memory because of its massive volume and continuous arrival. Moreover, it should be accessed only once and handled in time due to the high cost of multiple accesses. Therefore, the intrinsic nature of stream data calls facilitates the development of a summary in the main memory to enable fast incremental learning and to allow working in limited time and memory. Sampling techniques are one of the commonly used methods for constructing data stream summaries. Given that the traditional random sampling algorithm deviates from the real data distribution and does not consider the true distribution of the stream data attributes, we propose a novel sampling algorithm based on feature-selected and -preserved algorithm. We first use matrix approximation to select important features in stream data. Then, the feature-preserved sampling algorithm is used to generate high-quality representative samples over a sliding window. The sampling quality of our algorithm could guarantee a high degree of consistency between the distribution of attribute values in the population (the entire data) and that in the sample. Experiments on real datasets show that the proposed algorithm can select a representative sample with high efficiency.

[1]  Divesh Srivastava,et al.  Stratified random sampling from streaming and stored data , 2020, Distributed and Parallel Databases.

[2]  Yang Gao,et al.  Concept Drift Based Multi-dimensional Data Streams Sampling Method , 2019, PAKDD.

[3]  Stephen Shaoyi Liao,et al.  Sampling methods for summarizing unordered vehicle-to-vehicle data streams , 2012 .

[4]  Byung Suk Lee,et al.  Stratified Reservoir Sampling over Heterogeneous Data Streams , 2010, SSDBM.

[5]  Michael J. Maher,et al.  An Efficient Approach for Complex Data Summarization Using Multiview Clustering , 2014, Infoscale.

[6]  Edo Liberty,et al.  Simple and deterministic matrix sketching , 2012, KDD.

[7]  Zahir Tari,et al.  Data Summarization Techniques for Big Data - A Survey , 2015, Handbook on Data Centers.

[8]  Rafail Ostrovsky,et al.  Weighted sampling without replacement from data streams , 2015, Inf. Process. Lett..

[9]  José Fco. Martínez-Trinidad,et al.  A review of unsupervised feature selection methods , 2019, Artificial Intelligence Review.

[10]  Jieping Ye,et al.  Simultaneous feature and feature group selection through hard thresholding , 2014, KDD.

[11]  Mohiuddin Ahmed,et al.  Reservoir-based network traffic stream summarization for anomaly detection , 2018, Pattern Analysis and Applications.

[12]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[13]  Ramana Rao Kompella,et al.  Network Sampling: From Static to Streaming Graphs , 2012, TKDD.

[14]  Mohiuddin Ahmed Data summarization: a survey , 2018, Knowledge and Information Systems.

[15]  Peter J. Haas,et al.  A bi-level Bernoulli scheme for database sampling , 2004, SIGMOD '04.

[16]  Nisheeth K. Vishnoi,et al.  Fair and Diverse DPP-based Data Summarization , 2018, ICML.

[17]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[18]  Ming-Syan Chen,et al.  Feature-preserved sampling over streaming data , 2009, TKDD.

[19]  Eyke Hüllermeier,et al.  Open challenges for data stream mining research , 2014, SKDD.

[20]  Lei Yu,et al.  Bernoulli sampling based (ε, δ)-approximate aggregation in large-scale sensor networks , 2010, INFOCOM 2010.

[21]  Vishal Gupta,et al.  Recent automatic text summarization techniques: a survey , 2016, Artificial Intelligence Review.

[22]  Hao Huang,et al.  Unsupervised Feature Selection on Data Streams , 2015, CIKM.