Apache Flink™: Stream and Batch Processing in a Single Engine

Modern enterprise applications are currently undergoing a complete paradigm shift away from traditional transactional processing to combined analytical and transactional processing. This challenge of combining two opposing query types in a single database management system results in additional requirements for transaction management as well. In this paper, we discuss our approach to achieve high throughput for transactional query processing while allowing concurrent analytical queries. We present our approach to distributed snapshot isolation and optimized two-phase commit protocols.

[1]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[2]  Bruce G. Lindsay,et al.  Transaction management in the R* distributed database management system , 1986, TODS.

[3]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[4]  Per-Åke Larson,et al.  Transaction management in Hermes , 1991, CASCON.

[5]  Hamid Pirahesh,et al.  Efficient and flexible methods for transient versioning of records to avoid locking by read-only transactions , 1992, SIGMOD '92.

[6]  Jayant R. Haritsa,et al.  Revisiting commit processing in distributed database systems , 1997, SIGMOD '97.

[7]  H. V. Jagadish,et al.  Scalable versioning in distributed databases with commuting updates , 1997, Proceedings 13th International Conference on Data Engineering.

[8]  H. V. Jagadish,et al.  Asynchronous version advancement in a distributed three version database , 1998, Proceedings 14th International Conference on Data Engineering.

[9]  Michael J. Franklin,et al.  PSoup: a system for streaming queries over streaming data , 2003, The VLDB Journal.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[12]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[13]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[14]  Theodore Johnson,et al.  Out-of-order processing: a new architecture for high-performance stream systems , 2008, Proc. VLDB Endow..

[15]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[16]  Hasso Plattner,et al.  A common database approach for OLTP and OLAP using an in-memory column database , 2009, SIGMOD Conference.

[17]  Badrish Chandramouli,et al.  On-the-fly Progress Detection in Iterative Stream Queries , 2009, Proc. VLDB Endow..

[18]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[19]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[20]  Daniel J. Abadi,et al.  Low overhead concurrency control for partitioned main memory databases , 2010, SIGMOD Conference.

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  Jingren Zhou,et al.  Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[23]  Jean Bacon,et al.  SEEP: scalable and elastic event processing , 2010, Middleware Posters '10.

[24]  Pradeep Dubey,et al.  Fast Updates on Read-Optimized Databases Using Multi-Core CPUs , 2011, Proc. VLDB Endow..

[25]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[26]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[27]  Norman May,et al.  The SAP HANA Database -- An Architecture Overview , 2012, IEEE Data Eng. Bull..

[28]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[29]  Wolfgang Lehner,et al.  Efficient transaction processing in SAP HANA database: the end of a column store myth , 2012, SIGMOD Conference.

[30]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[31]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[32]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[33]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[34]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[35]  Seif Haridi,et al.  Lightweight Asynchronous Snapshots for Distributed Dataflows , 2015, ArXiv.

[36]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[37]  A. Kemper,et al.  On Graph Problems in a Semi-streaming Model , 2015 .

[38]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[39]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[40]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.