SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics

Many modern applications are a mixture of streaming, transactional and analytical workloads. However, traditional data platforms are each designed for supporting a specific type of workload. The lack of a single platform to support all these workloads has forced users to combine disparate products in custom ways. The common practice of stitching heterogeneous environments has caused enormous production woes by increasing complexity and the total cost of ownership. To support this class of applications, we present SnappyData as the first unified engine capable of delivering analytics, transactions, and stream processing in a single integrated cluster. We build this hybrid engine by carefully marrying a big data computational engine (Apache Spark) with a scale-out transactional store (Apache GemFire). We study and address the challenges involved in building such a hybrid distributed system with two conflicting components designed on drastically different philosophies: one being a lineage-based computational model designed for high-throughput analytics, the other a consensusand replication-based model designed for low-latency operations.

[1]  Ion Stoica,et al.  Blink and It's Done: Interactive Queries on Very Large Data , 2012, Proc. VLDB Endow..

[2]  Michael Stonebraker,et al.  S-Store: Streaming Meets Transaction Processing , 2015, Proc. VLDB Endow..

[3]  Byung Suk Lee,et al.  Stratified Reservoir Sampling over Heterogeneous Data Streams , 2010, SSDBM.

[4]  Jun Rao,et al.  Liquid: Unifying Nearline and Offline Big Data Integration , 2015, CIDR.

[5]  Carlo Zaniolo,et al.  ABS: a system for scalable approximate queries with accuracy guarantees , 2014, SIGMOD Conference.

[6]  Carlo Zaniolo,et al.  SMM: A data stream management system for knowledge discovery , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[7]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[8]  Pat Helland,et al.  Life beyond Distributed Transactions: an Apostate's Opinion , 2007, CIDR.

[9]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[10]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[11]  Donald Kossmann,et al.  Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database , 2015, SIGMOD Conference.

[12]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[13]  Ion Stoica,et al.  G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data , 2015, SIGMOD Conference.

[14]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[15]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[16]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[17]  Carlo Zaniolo,et al.  Optimal load shedding with aggregates and mining queries , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[18]  Carsten Binnig,et al.  Locality-aware Partitioning in Parallel Database Systems , 2015, SIGMOD Conference.

[19]  Barzan Mozafari,et al.  SnappyData : Streaming , Transactions , and Interactive Analytics in a Unified Engine , 2016 .

[20]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[21]  Carlo Zaniolo,et al.  High-performance complex event processing over XML streams , 2012, SIGMOD Conference.

[22]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[23]  Barzan Mozafari,et al.  A Handbook for Building an Approximate Query Engine , 2015, IEEE Data Eng. Bull..

[24]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[25]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[26]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[27]  Leslie Lamport,et al.  Consensus on transaction commit , 2004, TODS.

[28]  Hamid Pirahesh,et al.  Wildfire: Concurrent Blazing Data Ingest and Analytics , 2016, SIGMOD Conference.

[29]  Barzan Mozafari,et al.  CliffGuard: A Principled Framework for Finding Robust Database Designs , 2015, SIGMOD Conference.

[30]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[31]  Martin L. Kersten,et al.  MonetDB/DataCell: Online Analytics in a Streaming Column-Store , 2012, Proc. VLDB Endow..