Sampling algorithms in data stream environments

Data streams are large data sets generated continuously and at a fast tempo. Their arrival rate is large compared to the treatment and storage capacities. Thus, these streams cannot be entirely stored. That is why we need to treat them in a single pass, without storing them exhaustively. However, for a particular stream, it is not always possible to predict in advance all of the processing to be performed. It is therefore necessary to save some of this data for future treatments. These stored data then build “summaries”. Several ways exist for the construction of the summary, among them, the sampling algorithms. We propose in this paper an in-depth study of sampling methods used for the construction of data stream summaries. This paper includes two main parts. First, we introduce the basic concepts of data stream: Windowing models over data stream as well as data stream applications. Then we describe the different sampling algorithms used in stream environments. We particularly focus on their advantages and drawbacks. Finally, we compare the performance of the Simple Random Sampling to the chain sampling algorithm and we discuss the relevant research challenges for data stream sampling.

[1]  F. Olken,et al.  Maintenance of materialized views of sampling queries , 1992, [1992] Eighth International Conference on Data Engineering.

[2]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[3]  Raja Chiky Résumé de flux de données distribués , 2009 .

[4]  Hamid Mousavi,et al.  Summarizing Massive Information for Querying Web Sources and Data Streams , 2014 .

[5]  Peter J. Haas,et al.  A dip in the reservoir: maintaining sample synopses of evolving datasets , 2006, VLDB.

[6]  Boris Kovalerchuk,et al.  Data mining in finance: advances in relational and hybrid methods , 2000 .

[7]  Baili Zhang,et al.  Study of sampling techniques and algorithms in data stream environments , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[8]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[9]  Carsten Lund,et al.  Charging from sampled network usage , 2001, IMW '01.

[10]  Nesrine Gabsi,et al.  Extension et interrogation de résumés de flux de données. (Extending and querying data stream's summaries) , 2011 .

[11]  F. Clérot,et al.  StreamSamp DataStream Clustering Over Tilted Windows Through Sampling , 2006 .

[12]  Philippe Robert,et al.  Improving the detection of on-line vertical port scan in IP traffic , 2012, 2012 7th International Conference on Risks and Security of Internet and Systems (CRiSIS).

[13]  M. Chao A general purpose unequal probability sampling plan , 1982 .

[14]  Manoranjan Dash,et al.  Efficient Reservoir Sampling for Transactional Data Streams , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[15]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[16]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[17]  Philip S. Yu,et al.  A Survey of Synopsis Construction in Data Streams , 2007, Data Streams - Models and Algorithms.

[18]  Pavlos S. Efraimidis,et al.  Weighted Random Sampling over Data Streams , 2010, Algorithms, Probability, Networks, and Games.

[19]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[20]  Collectif d'Auteurs Midas,et al.  Résumé généraliste de flux de données , 2010 .

[21]  Levent Gürgen Gestion à grande échelle de données de capteurs hétérogènes , 2007 .

[22]  Lukasz Golab,et al.  Data Stream Management Issues { A Survey , 2003 .

[23]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[24]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[25]  Zhi-Li Zhang,et al.  Adaptive random sampling for load change detection , 2002, SIGMETRICS '02.

[26]  Fabrice CLEROT,et al.  Résumé généraliste de flux de données , 2010, EGC.

[27]  Feng Zhao,et al.  Distributed Group Management for Track Initiation and Maintenance in Target Localization Applications , 2003, IPSN.

[28]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[29]  Frédéric Giroire,et al.  Estimating the Number of Active Flows in a Data Stream over a Sliding Window , 2007, ANALCO.

[30]  Rainer Gemulla,et al.  Sampling algorithms for evolving datasets , 2008 .

[31]  Georges Hébrail,et al.  Sliding HyperLogLog: Estimating Cardinality in a Data Stream over a Sliding Window , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[32]  Zakia Kazi-Aoul,et al.  A performance study of the chain sampling algorithm , 2015, 2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS).

[33]  A. Winsor Sampling techniques. , 2000, Nursing times.

[34]  Paul G. Spirakis,et al.  Weighted Random Sampling , 2008, Encyclopedia of Algorithms.

[35]  Georges Hébrail,et al.  Résumé hybride de flux de données par échantillonnage et classification automatique , 2009, EGC.

[36]  A. I. McLeod,et al.  A Convenient Algorithm for Drawing a Simple Random Sample , 1983 .