Composed sketch framework for quantiles and cardinality queries over big data streams

Quantiles and Cardinality queries are important tools to analyze statistical information from big data streams. Due to the features of the streams, such as huge volume and high velocity, it is a challenging problem to quickly provide responses for the two types of queries using constrained space over big data streams. In this paper, we propose a composed sketch framework, which can support both quantiles queries and cardinality queries over the data streams. We introduce cardinality estimators into a baseline q-digest structure and propose unified sketch merging and query processing operations. Our approach can support these two types of queries simultaneously. We conduct detailed theoretical and experimental analysis in terms of query accuracy and query response time. The analytical and experimental results show that our approach can obtain accurate estimates quicker than traditional method and system in big data streams environments, and it just produces less than 0.8‰ storage overhead in TB-scale real-world data sets.

[1]  Frédéric Giroire,et al.  Order statistics and estimating cardinalities of massive data sets , 2009, Discret. Appl. Math..

[2]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[3]  Mike Paterson,et al.  Progress in Selection , 1996, SWAT.

[4]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[5]  Carlo Zaniolo,et al.  Fast computation of approximate biased histograms on sliding windows over data streams , 2013, SSDBM.

[6]  Philippe Flajolet,et al.  Counting by Coin Tossings , 2004, ASIAN.

[7]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[8]  Prashant J. Shenoy,et al.  Supporting Scalable Analytics with Latency Constraints , 2015, Proc. VLDB Endow..

[9]  Keqin Li,et al.  FastRAQ: A Fast Approach to Range-Aggregate Queries in Big Data Environments , 2015, IEEE Transactions on Cloud Computing.

[10]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[11]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[12]  Maria E. Orlowska,et al.  Range queries in dynamic OLAP data cubes , 2000, Data Knowl. Eng..

[13]  Xiuguo Bao,et al.  Dynamic sketching over distributed data streams , 2016, 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[14]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[15]  Beng Chin Ooi,et al.  TI: an efficient indexing mechanism for real-time search on tweets , 2011, SIGMOD '11.

[16]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[17]  H. Varian,et al.  Predicting the Present with Google Trends , 2012 .

[18]  Ion Stoica,et al.  G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data , 2015, SIGMOD Conference.

[19]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[20]  H. Varian,et al.  Predicting the Present with Google Trends , 2009 .

[21]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[22]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[23]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[24]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.

[25]  Gilad Mishne,et al.  Fast data in the era of big data: Twitter's real-time related query suggestion architecture , 2012, SIGMOD '13.

[26]  Odysseas Papapetrou,et al.  Sketching distributed sliding-window data streams , 2015, The VLDB Journal.

[27]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[28]  Prashant J. Shenoy,et al.  SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce , 2012, TODS.

[29]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[30]  Andrey Brito,et al.  Scalable and Low-Latency Data Processing with Stream MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[31]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.