Kafka, Samza and the Unix Philosophy of Distributed Data

Apache Kafka is a scalable message broker, and Apache Samza is a stream processing framework built upon Kafka. They are widely used as infrastructure for implementing personalized online services and real-time predictive analytics. Besides providing high throughput and low latency, Kafka and Samza are designed with operational robustness and long-term maintenance of applications in mind. In this paper we explain the reasoning behind the design of Kafka and Samza, which allow complex applications to be built by composing a small number of simple primitives – replicated logs and stream operators. We draw parallels between the design of Kafka and Samza, batch processing pipelines, database architecture, and the design philosophy of Unix.

[1]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[2]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[3]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[4]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Henrik Loeser,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone? , 2011, BTW.

[7]  Pat Helland,et al.  Immutability Changes Everything , 2015, CIDR.

[8]  Margo I. Seltzer,et al.  Beyond Relational Databases , 2005, ACM Queue.

[9]  Sam Shah,et al.  The big data ecosystem at LinkedIn , 2013, SIGMOD '13.

[10]  B. A. Tague,et al.  UNIX time-sharing system: Foreword , 1978, The Bell System Technical Journal.

[11]  Ken Thompson,et al.  The UNIX time-sharing system , 1974, CACM.

[12]  Herodotos Herodotou,et al.  Massively Parallel Databases and MapReduce Systems , 2013, Found. Trends Databases.

[13]  Jun Rao,et al.  Building a Replicated Logging System with Apache Kafka , 2015, Proc. VLDB Endow..

[14]  Jun Rao,et al.  Liquid: Unifying Nearline and Offline Big Data Integration , 2015, CIDR.

[15]  Jun Rao,et al.  Building LinkedIn's Real-time Activity Data Pipeline , 2012, IEEE Data Eng. Bull..

[16]  Christian Posse,et al.  Metaphor: a system for related search recommendations , 2012, CIKM.

[17]  Christian Posse,et al.  The Browsemaps: Collaborative Filtering at LinkedIn , 2014, RSWeb@RecSys.

[18]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[19]  Brian W. Kernighan,et al.  Program design in the UNIX† environment , 2007 .

[20]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[21]  A. Retrospective,et al.  The UNIX Time-sharing System , 1977 .