论文信息 - Kafka, Samza and the Unix Philosophy of Distributed Data - 字舞流文

Kafka, Samza and the Unix Philosophy of Distributed Data

Apache Kafka is a scalable message broker, and Apache Samza is a stream processing framework built upon Kafka. They are widely used as infrastructure for implementing personalized online services and real-time predictive analytics. Besides providing high throughput and low latency, Kafka and Samza are designed with operational robustness and long-term maintenance of applications in mind. In this paper we explain the reasoning behind the design of Kafka and Samza, which allow complex applications to be built by composing a small number of simple primitives – replicated logs and stream operators. We draw parallels between the design of Kafka and Samza, batch processing pipelines, database architecture, and the design philosophy of Unix.

Martin Kleppmann | Jay Kreps | Martin Kleppmann | J. Kreps

[1] Jay Kreps,et al. Kafka : a Distributed Messaging System for Log Processing , 2011 .

[2] Mendel Rosenblum,et al. The design and implementation of a log-structured file system , 1991, SOSP '91.

[3] Tao Zou,et al. Tango: distributed data structures over a shared log , 2013, SOSP.

[4] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[5] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6] Henrik Loeser,et al. "One Size Fits All": An Idea Whose Time Has Come and Gone? , 2011, BTW.

[7] Pat Helland,et al. Immutability Changes Everything , 2015, CIDR.

[8] Margo I. Seltzer,et al. Beyond Relational Databases , 2005, ACM Queue.

[9] Sam Shah,et al. The big data ecosystem at LinkedIn , 2013, SIGMOD '13.

[10] B. A. Tague,et al. UNIX time-sharing system: Foreword , 1978, The Bell System Technical Journal.

[11] Ken Thompson,et al. The UNIX time-sharing system , 1974, CACM.

[12] Herodotos Herodotou,et al. Massively Parallel Databases and MapReduce Systems , 2013, Found. Trends Databases.

[13] Jun Rao,et al. Building a Replicated Logging System with Apache Kafka , 2015, Proc. VLDB Endow..

[14] Jun Rao,et al. Liquid: Unifying Nearline and Offline Big Data Integration , 2015, CIDR.

[15] Jun Rao,et al. Building LinkedIn's Real-time Activity Data Pipeline , 2012, IEEE Data Eng. Bull..

[16] Christian Posse,et al. Metaphor: a system for related search recommendations , 2012, CIKM.

[17] Christian Posse,et al. The Browsemaps: Collaborative Filtering at LinkedIn , 2014, RSWeb@RecSys.

[18] Craig Chambers,et al. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[19] Brian W. Kernighan,et al. Program design in the UNIX† environment , 2007 .

[20] Patrick E. O'Neil,et al. The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[21] A. Retrospective,et al. The UNIX Time-sharing System , 1977 .