Synopsis: A Distributed Sketch over Voluminous Spatiotemporal Observational Streams

Networked observational devices have proliferated in recent years, contributing to voluminous data streams from a variety of sources and problem domains. These streams often have a spatiotemporal component and include multidimensional features of interest. Processing such data in an offline fashion using batch systems or data warehouses is costly from both a storage and computational standpoint, and in many situations the insights derived from the data streams are useful only if they are timely. In this study, we propose Synopsis, an online, distributed sketch that is constructed from voluminous spatiotemporal data streams. The sketch summarizes feature values and inter-feature relationships in memory to facilitate real-time query evaluations and to serve as input to computations expressed using analytical engines. As the data streams evolve, Synopsis performs targeted dynamic scaling to ensure high accuracy and effective resource utilization. We evaluate our system in the context of two real-world spatiotemporal datasets and demonstrate its efficacy in both scalability and query evaluations.

[1]  John Langford Vowpal Wabbit , 2014 .

[2]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[3]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[4]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[5]  Shrideep Pallickara,et al.  Autonomously improving query evaluations over multidimensional data in distributed hash tables , 2013, CAC.

[6]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[7]  Shrideep Pallickara,et al.  Online Scheduling and Interference Alleviation for Low-Latency, High-Throughput Processing of Data Streams , 2017, IEEE Transactions on Parallel and Distributed Systems.

[8]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[9]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[10]  Kun-Lung Wu,et al.  Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[12]  Christof Fetzer,et al.  Auto-scaling techniques for elastic data stream processing , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[13]  Panos Kalnis,et al.  Indexing spatio-temporal data warehouses , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[15]  Tim Kraska,et al.  Stormy: an elastic and highly available streaming service in the cloud , 2012, EDBT-ICDT '12.

[16]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[17]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[18]  José Antonio Lozano,et al.  A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments , 2014, Journal of Grid Computing.

[19]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[20]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.

[21]  Inderpal Singh Mumick,et al.  Maintenance of data cubes and summary tables in a warehouse , 1997, SIGMOD '97.

[22]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[23]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[24]  Ivona Brandic,et al.  Revealing the MAPE loop for the autonomic management of Cloud infrastructures , 2011, 2011 IEEE Symposium on Computers and Communications (ISCC).

[25]  Shrideep Pallickara,et al.  Analytic Queries over Geospatial Time-Series Data Using Distributed Hash Tables , 2016, IEEE Transactions on Knowledge and Data Engineering.

[26]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[27]  B. Welford Note on a Method for Calculating Corrected Sums of Squares and Products , 1962 .

[28]  Shrideep Pallickara,et al.  Fast, Ad Hoc Query Evaluations over Multidimensional Geospatial Datasets , 2017, IEEE Transactions on Cloud Computing.

[29]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[30]  Shrideep Pallickara,et al.  NEPTUNE: Real Time Stream Processing for Internet of Things and Sensing Environments , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[32]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[33]  Danijel Skocaj,et al.  Multivariate online kernel density estimation with Gaussian kernels , 2011, Pattern Recognit..

[34]  Thomas S. Heinze,et al.  Elastic Complex Event Processing under Varying Query Load , 2013, BD3@VLDB.

[35]  Martín Abadi,et al.  TensorFlow: learning functions at scale , 2016, ICFP.

[36]  Jeffrey Considine,et al.  Spatio-temporal aggregation using sketches , 2004, Proceedings. 20th International Conference on Data Engineering.

[37]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).