论文信息 - AdBench: A Complete Benchmark for Modern Data Pipelines

AdBench: A Complete Benchmark for Modern Data Pipelines

Since the introduction of Apache YARN, which modularly separated resource management and scheduling from the distributed programming frameworks, a multitude of YARN-native computation frameworks have been developed. These frameworks specialize in specific analytics variants. In addition to traditional batch-oriented computations (e.g. MapReduce, Apache Hive [14] and Apache Pig [18]), the Apache Hadoop ecosystem now contains streaming analytics frameworks (e.g. Apache Apex [8]), MPP SQL engines (e.g. Apache Trafodion [20], Apache Impala [15], and Apache HAWQ [12]), OLAP cubing frameworks (e.g. Apache Kylin [17]), frameworks suitable for iterative machine learning (e.g. Apache Spark [19] and Apache Flink [10]), and graph processing (e.g. GraphX). With emergence of Hadoop Distributed File System and its various implementations as preferred method of constructing a data lake, end-to-end data pipelines are increasingly being built on the Hadoop-based data lake platform.

Milind Bhandarkar | M. Bhandarkar

[1] Tilmann Rabl,et al. Benchmarking Big Data Systems and the BigData Top100 List , 2013, Big Data.

[2] Karl Huppler,et al. TPC Express - A New Path for TPC Benchmarks , 2013, TPCTC.

[3] Carlo Curino,et al. Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data , 2014, TPCTC.