Slider: incremental sliding window analytics

Sliding window analytics is often used in distributed data-parallel computing for analyzing large streams of continuously arriving data. When pairs of consecutive windows overlap, there is a potential to update the output incrementally, more efficiently than recomputing from scratch. However, in most systems, realizing this potential requires programmers to explicitly manage the intermediate state for overlapping windows, and devise an application-specific algorithm to incrementally update the output. In this paper, we present self-adjusting contraction trees, a set of data structures and algorithms for transparently updating the output of a sliding window computation as the window moves, while reusing, to the extent possible, results from prior computations. Self-adjusting contraction trees structure sub-computations of a data-parallel computation in the form of a shallow (logarithmic depth) balanced data dependence graph, through which input changes are efficiently propagated in asymptotically sub-linear time. We implemented self-adjusting contraction trees in a system called Slider. The design of Slider incorporates several novel techniques, most notably: (i) a set of self balancing trees tuned for different variants of sliding window computation (append-only, fixed-width, or variable-width slides); (ii) a split processing mode, where a background pre-processing stage leverages the predictability of input changes to pave the way for a more efficient foreground processing when the window slides; and (iii) an extension of the data structures to handle multiple-job workflows such as data-flow query processing. We evaluated Slider using a variety of applications and real-world case studies. Our results show significant performance gains without requiring any changes to the existing application code used for non-incremental data processing.

[1]  Pramod Bhatotia,et al.  Large-scale Incremental Data Processing with Change Propagation , 2011, HotCloud.

[2]  Christopher Olston,et al.  Stateful bulk processing for incremental analytics , 2010, SoCC '10.

[3]  Ying Li,et al.  Microsoft CEP Server and Online Behavioral Targeting , 2009, Proc. VLDB Endow..

[4]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[5]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[6]  Sartaj Sahni,et al.  Handbook of Data Structures and Applications , 2004 .

[7]  Umut A. Acar Self-adjusting computation: (an overview) , 2009, PEPM '09.

[8]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[9]  Guy E. Blelloch,et al.  An experimental analysis of self-adjusting computation , 2009 .

[10]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[11]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[12]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[13]  Krishna P. Gummadi,et al.  On word-of-mouth based discovery of the web , 2011, IMC '11.

[14]  Chao Tian,et al.  Nova: continuous Pig/Hadoop workflows , 2011, SIGMOD '11.

[15]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[16]  Haifeng Jiang,et al.  Photon: fault-tolerant and scalable joining of continuous data streams , 2013, SIGMOD '13.

[17]  Marcel Dischinger,et al.  Glasnost: Enabling End Users to Detect Traffic Differentiation , 2010, NSDI.

[18]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[19]  Thomas W. Reps,et al.  A categorized bibliography on incremental computation , 1993, POPL '93.

[20]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[21]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[22]  Pramod Bhatotia,et al.  Asymptotic Analysis of Self-Adjusting Contraction Trees , 2016, ArXiv.

[23]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[24]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[25]  Akshat Verma,et al.  Shredder: GPU-accelerated incremental storage and computation , 2012, FAST.

[26]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[27]  Robert Harper Self-adjusting computation , 2004, LICS 2004.

[28]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[31]  Bingsheng He,et al.  Comet: batched stream processing for data intensive distributed computing , 2010, SoCC '10.

[32]  Guy E. Blelloch,et al.  Traceable data types for self-adjusting computation , 2010, PLDI '10.

[33]  Michael Isard,et al.  DryadInc: Reusing Work in Large-scale Computations , 2009, HotCloud.

[34]  Roberto Tamassia,et al.  Dynamic algorithms in computational geometry , 1992, Proc. IEEE.

[35]  Eric A. Brewer,et al.  Highly available, fault-tolerant, parallel dataflows , 2004, SIGMOD '04.

[36]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[37]  Andreas Haeberlen,et al.  Reliable Client Accounting for P2P-Infrastructure Hybrids , 2012, NSDI.

[38]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..