SnappyData : Streaming , Transactions , and Interactive Analytics in a Unified Engine

In recent years, our customers have expressed frustration in the traditional approach of using a combination of disparate products to handle their streaming, transactional and analytical needs. The common practice of stitching heterogeneous environments in custom ways has caused enormous production woes by increasing development complexity and total cost of ownership. With SnappyData, an open source platform, we propose a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an inmemory transactional store with scale-out SQL semantics). After presenting a few use case scenarios, we carefully study the challenges involved in marrying these two systems with drastically different design philosophies: Spark is a computational model designed for high-throughput analytics whereas GemFire is a transactional engine designed for low latency operations. Moreover, we find that even in-memory solutions are often incapable of delivering truly interactive analytics (i.e., a couple of seconds), when faced with large data volumes or high velocity streams. SnappyData therefore combines state-ofthe-art approximate query processing techniques and a variety of data synopses to ensure interactive analytics over both streaming and stored data. Through a novel concept of high-level accuracy contracts (HAC), SnappyData is the first to offer end users an intuitive means for expressing their accuracy requirements without overwhelming them with statistical concepts.

[1]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.

[2]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[3]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[4]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[5]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[6]  Chris Jermaine,et al.  Relational confidence bounds are easy with the bootstrap , 2005, SIGMOD '05.

[7]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[8]  Pat Helland,et al.  Life beyond Distributed Transactions: an Apostate's Opinion , 2007, CIDR.

[9]  Chris Jermaine,et al.  Robust Stratified Sampling Plans for Low Selectivity Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[10]  Philip S. Yu,et al.  MobiQual: QoS-aware Load Shedding in Mobile CQ Systems , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[12]  Carlo Zaniolo,et al.  Optimal load shedding with aggregates and mining queries , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[14]  Carlo Zaniolo,et al.  SMM: A data stream management system for knowledge discovery , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[15]  Carlo Zaniolo,et al.  High-performance complex event processing over XML streams , 2012, SIGMOD Conference.

[16]  Martin L. Kersten,et al.  MonetDB/DataCell: Online Analytics in a Streaming Column-Store , 2012, Proc. VLDB Endow..

[17]  Alexander J. Smola,et al.  Hokusai - Sketching Streams in Real Time , 2012, UAI.

[18]  Ion Stoica,et al.  Blink and It's Done: Interactive Queries on Very Large Data , 2012, Proc. VLDB Endow..

[19]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[20]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[21]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[22]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[23]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[24]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[25]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[26]  Carlo Zaniolo,et al.  ABS: a system for scalable approximate queries with accuracy guarantees , 2014, SIGMOD Conference.

[27]  Donald Kossmann,et al.  Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database , 2015, SIGMOD Conference.

[28]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[29]  Carsten Binnig,et al.  Locality-aware Partitioning in Parallel Database Systems , 2015, SIGMOD Conference.

[30]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[31]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[32]  Barzan Mozafari,et al.  A Handbook for Building an Approximate Query Engine , 2015, IEEE Data Eng. Bull..

[33]  Michael J. Cafarella,et al.  Visualization-aware sampling for very large databases , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).