论文信息 - Composable and efficient functional big data processing framework

Composable and efficient functional big data processing framework

Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and packaged as executable jars without any functionality being exposed or described. This means that deployed jobs are not natively composable and reusable for subsequent development. Besides, it also hampers the ability for applying optimizations on the data flow of job sequences and pipelines. In this paper, we present the Hierarchically Distributed Data Matrix (HDM) which is a functional, strongly-typed data representation for writing composable big data applications. Along with HDM, a runtime framework is provided to support the execution of HDM applications on distributed infrastructures. Based on the functional data dependency graph of HDM, multiple optimizations are applied to improve the performance of executing HDM jobs. The experimental results show that our optimizations can achieve improvements of between 10% to 60% of the Job-Completion-Time for different types of operation sequences when compared with the current state of art, Apache Spark.

Dongyao Wu | Sherif Sakr | Liming Zhu | Qinghua Lu

[1] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[2] Carlo Curino,et al. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[3] Craig Chambers,et al. FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[4] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[6] Felix Naumann,et al. The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[7] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[8] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[9] Sherif Sakr,et al. The family of mapreduce and large-scale data processing systems , 2013, CSUR.

[10] Andreas Neumann,et al. Oozie: towards a scalable workflow management system for Hadoop , 2012, SWEET '12.