Efficient Incremental Data Analysis

Many data-intensive applications require real-time analytics over streaming data. In a growing number of domains -- sensor network monitoring, social web applications, clickstream analysis, high-frequency algorithmic trading, and fraud detections to name a few -- applications continuously monitor stream events to promptly react to certain data conditions. These applications demand responsive analytics even when faced with high volume and velocity of incoming changes, large numbers of users, and complex processing requirements. Developing suitable online analytics engine that meets these requirements is challenging. In this thesis, we study techniques for efficient online processing of complex analytical queries, ranging from standard database queries to complex machine learning and digital signal processing workflows. First, we focus on the problem of efficient incremental computation for database queries. We have developed a system, called DBToaster, that compiles declarative queries into high-performance stream processing engines that keep query results (views) fresh at very high update rates. At the heart of our system is a recursive query compilation algorithm that materializes a set of supporting higher-order delta views to achieve a substantially lower view maintenance cost. We study the trade-offs between single-tuple and batch incremental processing in local execution, and we present a novel approach for compiling view maintenance code into data-parallel programs optimized for distributed execution. DBToaster supports millions of complete view refreshes per second for a broad range of queries and outperforms commercial database and stream engines by orders of magnitude. We also study the incremental computation for queries written as iterative linear algebra, which can capture many machine learning and scientific calculations. We have developed a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change and make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our last research question concerns the integration of general-purpose query processors and domain-specific operations to enable deep data exploration in both online and offline analysis. We advocate a deep integration of signal processing operations and general-purpose query processors. We demonstrate that in-situ processing of tempo-relational and signal data through a unified query language empowers users to express end-to-end workflows more succinctly inside one system while at the same time offering orders of magnitude better performance than existing popular data management systems.

[1]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[2]  M. Seeger Low Rank Updates for the Cholesky Decomposition , 2004 .

[3]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[4]  Jennifer Widom,et al.  On-line warehouse view maintenance , 1997, SIGMOD '97.

[5]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[6]  Kenneth A. Ross,et al.  Supporting multiple view maintenance policies , 1997, SIGMOD '97.

[7]  Nick Roussopoulos,et al.  A case for dynamic view management , 2001, ACM Trans. Database Syst..

[8]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[9]  Frank Wm. Tompa,et al.  Efficiently updating materialized views , 1986, SIGMOD '86.

[10]  Christoph Koch,et al.  Incremental query evaluation in a ring of databases , 2010, PODS.

[11]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[12]  A. Elmagarmid,et al.  Supporting Views in Data Stream Management System , 2007 .

[13]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[14]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[15]  Steven W. Smith,et al.  The Scientist and Engineer's Guide to Digital Signal Processing , 1997 .

[16]  Ryan Newton,et al.  The Case for a Signal-Oriented Data Stream Management System , 2007, CIDR.

[17]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[18]  Kenneth A. Ross,et al.  Implementing Incremental View Maintenance in Nested Data Models , 1997, DBPL.

[19]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[20]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.

[23]  Umeshwar Dayal,et al.  Query optimization for CODASYL database systems , 1982, SIGMOD '82.

[24]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[25]  Amir Shaikhha,et al.  How to Architect a Query Compiler , 2016, SIGMOD Conference.

[26]  Yanfeng Zhang,et al.  iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, IPDPS Workshops.

[27]  Michael J. Franklin,et al.  On-the-fly sharing for streamed aggregation , 2006, SIGMOD Conference.

[28]  Alvin AuYoung,et al.  Using R for Iterative and Incremental Processing , 2012, HotCloud.

[29]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[30]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[31]  Yanhong A. Liu,et al.  Static caching for incremental computation , 1998, TOPL.

[32]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[33]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[34]  Wolfgang Lehner,et al.  Efficient exploitation of similar subexpressions for query processing , 2007, SIGMOD '07.

[35]  Christian S. Jensen,et al.  Temporal specialization , 1992, [1992] Eighth International Conference on Data Engineering.

[36]  Jennifer Widom,et al.  A System Prototype for Warehouse View Maintenance , 1996, VIEWS.

[37]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[38]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[39]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[40]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[41]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[42]  Pavel Berkhin,et al.  A Survey on PageRank Computing , 2005, Internet Math..

[43]  Ralph Kimball,et al.  The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses , 1996 .

[44]  Barak A. Pearlmutter,et al.  Lazy multivariate higher-order forward-mode AD , 2007, POPL '07.

[45]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[46]  Badrish Chandramouli,et al.  The extensibility framework in Microsoft StreamInsight , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[47]  Florian Waas Beyond Conventional Data Warehousing - Massively Parallel Data Processing with Greenplum Database - (Invited Talk) , 2008, BIRTE.

[48]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[49]  Nick Roussopoulos,et al.  An incremental access method for ViewCache: concept, algorithms, and cost analysis , 1991, TODS.

[50]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[51]  Jonathan Goldstein,et al.  Consistent Streaming Through Time: A Vision for Event Stream Processing , 2006, CIDR.

[52]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[53]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[54]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[55]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[56]  Chee-Yee Chong,et al.  Sensor networks: evolution, opportunities, and challenges , 2003, Proc. IEEE.

[57]  Ying Zhang,et al.  SciQL: array data processing inside an RDBMS , 2013, SIGMOD '13.

[58]  Milos Nikolic,et al.  LINVIEW: incremental view maintenance for complex analytical queries , 2014, SIGMOD Conference.

[59]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[60]  Michael Isard,et al.  Differential Dataflow , 2013, CIDR.

[61]  Yannis Papakonstantinou,et al.  Combining Databases and Signal Processing in Plato , 2015, CIDR.

[62]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[63]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[64]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[65]  Guy E. Blelloch,et al.  An experimental analysis of self-adjusting computation , 2009 .

[66]  Walid G. Aref,et al.  Nile: a query processing engine for data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[67]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[68]  Ramesh C. Agarwal,et al.  Block oriented processing of relational database operations in modern computer architectures , 2001, Proceedings 17th International Conference on Data Engineering.

[69]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[70]  Philippe Bonnet,et al.  Towards Sensor Database Systems , 2001, Mobile Data Management.

[71]  Jennifer Widom,et al.  Static analysis techniques for predicting the behavior of active database rules , 1995, TODS.

[72]  Christopher Umans,et al.  Group-theoretic Algorithms for Matrix Multiplication , 2005, FOCS.

[73]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[74]  Michael Stonebraker,et al.  S-Store: A Streaming NewSQL System for Big Velocity Applications , 2014, Proc. VLDB Endow..

[75]  Hicham G. Elmongui,et al.  Lazy Maintenance of Materialized Views , 2007, VLDB.

[76]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[77]  Taher H. Haveliwala,et al.  Adaptive methods for the computation of PageRank , 2004 .

[78]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[79]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[80]  Hamid Pirahesh,et al.  Complex query decorrelation , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[81]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[82]  Andrew W. Appel,et al.  Continuation-passing, closure-passing style , 1989, POPL '89.

[83]  Ambuj K. Singh,et al.  Efficient view maintenance at data warehouses , 1997, SIGMOD '97.

[84]  Bruce G. Lindsay,et al.  How to roll a join: asynchronous incremental view maintenance , 2000, SIGMOD '00.

[85]  Latha S. Colby,et al.  Algorithms for deferred view maintenance , 1996, SIGMOD '96.

[86]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[87]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[88]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[89]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[90]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[91]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[92]  Michael Stonebraker,et al.  One Size Fits All? Part 2: Benchmarking Studies , 2007, CIDR.

[93]  Jennifer Widom,et al.  Incremental computation and maintenance of temporal aggregates , 2001, Proceedings 17th International Conference on Data Engineering.

[94]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[95]  Dan Suciu,et al.  View Selection for Stream Processing , 2002, WebDB.

[96]  Christoph Koch,et al.  DBToaster: Agile Views for a Dynamic Data Management System , 2011, CIDR.

[97]  Inderpal Singh Mumick,et al.  Selection of Views to Materialize in a Data Warehouse , 2005, IEEE Trans. Knowl. Data Eng..

[98]  Sudipto Guha,et al.  REX: Recursive, Delta-Based Data-Centric Computation , 2012, Proc. VLDB Endow..

[99]  Ryan Newton,et al.  XStream: a Signal-Oriented Data Stream Management System , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[100]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[101]  Umut A. Acar,et al.  Type-directed automatic incrementalization , 2012, PLDI '12.

[102]  Wei Hong,et al.  TinyDB: an acquisitional query processing system for sensor networks , 2005, TODS.

[103]  Jeffrey F. Naughton,et al.  Rate-based query optimization for streaming information sources , 2002, SIGMOD '02.

[104]  Inderpal Singh Mumick,et al.  Incremental Maintenance Of Views With Duplicates , 1999 .

[105]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[106]  Kenneth A. Ross,et al.  Materialized view maintenance and integrity constraint checking: trading space for time , 1996, SIGMOD '96.

[107]  Kyuseok Shim,et al.  Optimizing queries with materialized views , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[108]  Jingren Zhou,et al.  Efficient Maintenance of Materialized Outer-Join Views , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[109]  Milos Nikolic,et al.  DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views , 2012, Proc. VLDB Endow..

[110]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[111]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[112]  Andrew Heybey,et al.  Tribeca: A System for Managing Large Databases of Network Traffic , 1998, USENIX Annual Technical Conference.

[113]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[114]  Milos Nikolic,et al.  How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates , 2016, SIGMOD Conference.

[115]  Vipin Kumar,et al.  Introduction to Parallel Computing , 1994 .

[116]  Christoph Koch,et al.  DBToaster: A SQL Compiler for High-Performance Delta Processing in Main-Memory Databases , 2009, Proc. VLDB Endow..

[117]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[118]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[119]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm , 2012, 1210.0481.

[120]  David Maier,et al.  Semantics of Data Streams and Operators , 2005, ICDT.

[121]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.

[122]  Raghunath Othayoth Nambiar,et al.  Transaction Processing Performance Council (TPC): State of the Council 2010 , 2010, TPCTC.

[123]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[124]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[125]  Hamid Pirahesh,et al.  Incremental Maintenance for Non-Distributive Aggregate Functions , 2002, VLDB.