Incremental Sliding Window Analytics

Sliding-window computations are widely used for large-scale data analysis, particularly in live systems where new data arrives continuously. These computations consume significant computational resources because they usually recompute over the full window of data every time the window slides. In this chapter, we propose techniques for improving the scalability of sliding-window computations by performing them incrementally. In our approach, when some new data is added at the end of the window or old data dropped from its beginning, the output is updated automatically and efficiently by reusing previously run sub-computations. The key idea behind our approach is to organize the sub-computations as a shallow (logarithmic depth) balanced tree and perform incremental updates by propagating changes through this tree. This approach is motivated and inspired by advances on self-adjusting computation, which enables automatic and efficient incremental computation. We present an Hadoop based implementation that also provides a dataflow query processing interface. We evaluate it with a variety of applications and real-world case studies. Our results show significant performance improvements for large-scale sliding-window computations without any modifications to the existing data analysis code. Pramod Bhatotia University of Edinburgh, e-mail: pramod.bhatotia@ed.ac.uk Umut A. Acar CMU, e-mail: umut@cs.cmu.edu Flavio P. Junqueira Dell EMC, e-mail: fpj@apache.org Rodrigo Rodrigues IST Lisbon, e-mail: rodrigo.miragaia.rodrigues@tecnico.ulisboa.pt

[1]  Pramod Bhatotia,et al.  Large-scale Incremental Data Processing with Change Propagation , 2011, HotCloud.

[2]  Christof Fetzer,et al.  Approximate Stream Analytics in Apache Flink and Apache Spark Streaming , 2017, ArXiv.

[3]  Christopher Olston,et al.  Stateful bulk processing for incremental analytics , 2010, SoCC '10.

[4]  Pramod Bhatotia,et al.  iThreads: A Threading Library for Parallel Incremental Computation , 2015, ASPLOS.

[5]  R. Rodrigues,et al.  Conductor: orchestrating the clouds , 2010, LADIS '10.

[6]  Antony I. T. Rowstron,et al.  Camdoop: Exploiting In-network Aggregation for Big Data Applications , 2012, NSDI.

[7]  Christof Fetzer,et al.  PrivApprox: Privacy-Preserving Stream Analytics , 2019, Informatik Spektrum.

[8]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[9]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[10]  Christof Fetzer,et al.  StreamApprox: approximate computing for stream analytics , 2017, Middleware.

[11]  Roberto Tamassia,et al.  Dynamic algorithms in computational geometry , 1992, Proc. IEEE.

[12]  Pramod Bhatotia,et al.  Orchestrating the Deployment of Computations in the Cloud with Conductor , 2012, NSDI.

[13]  Pramod Bhatotia,et al.  Brief announcement: modelling MapReduce for optimal execution in the cloud , 2010, PODC.

[14]  Umut A. Acar,et al.  Slider : Incremental Sliding-Window Computations for Large-Scale Data Analysis , 2012 .

[15]  Thomas W. Reps,et al.  A categorized bibliography on incremental computation , 1993, POPL '93.

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[18]  Pramod Bhatotia,et al.  Incremental parallel and distributed systems , 2015 .

[19]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[20]  Pramod Bhatotia,et al.  Slider: incremental sliding window analytics , 2014, Middleware.

[21]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[22]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[23]  Umut A. Acar Self-adjusting computation: (an overview) , 2009, PEPM '09.

[24]  Umut A. Acar,et al.  Adaptive Exact Inference in Graphical Models , 2011, J. Mach. Learn. Res..

[25]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[26]  Guy E. Blelloch,et al.  An experimental analysis of self-adjusting computation , 2009 .

[27]  Umut A. Acar,et al.  Dynamic well-spaced point sets , 2010, Comput. Geom..

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[30]  Bingsheng He,et al.  Comet: batched stream processing for data intensive distributed computing , 2010, SoCC '10.

[31]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[32]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[33]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[34]  Pramod Bhatotia,et al.  Asymptotic Analysis of Self-Adjusting Contraction Trees , 2016, ArXiv.

[35]  Christof Fetzer,et al.  IncApprox: A Data Analytics System for Incremental Approximate Computing , 2016, WWW.

[36]  Christof Fetzer,et al.  Privacy Preserving Stream Analytics: The Marriage of Randomized Response and Approximate Computing , 2017, ArXiv.

[37]  Sartaj Sahni,et al.  Handbook of Data Structures and Applications , 2004 .

[38]  Ken Yocum,et al.  In-situ MapReduce for Log Processing , 2011, USENIX Annual Technical Conference.

[39]  Inderpal Singh Mumick,et al.  Deriving Production Rules For Incremental View Maintenance , 1999 .

[40]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[41]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.