CHOPPER: Optimizing Data Partitioning for In-memory Data Analytics Frameworks

The performance of in-memory based data analytic frameworks such as Spark is significantly affected by how data is partitioned. This is because the partitioning effectively determines task granularity and parallelism. Moreover, different phases of a workload execution can have different optimal partitions. However, in the current implementations, the tuning knobs controlling the partitioning are either configured statically or involve a cumbersome programmatic process for affecting changes at runtime. In this paper, we propose CHOPPER, a system for automatically determining the optimal number of partitions for each phase of a workload and dynamically changing the partition scheme during workload execution. CHOPPER monitors the task execution and DAG scheduling information to determine the optimal level of parallelism. CHOPPER repartitions data as needed to ensure efficient task granularity, avoids data skew, and reduces shuffle traffic. Thus, CHOPPER allows users to write applications without having to hand-tune for optimal parallelism. Experimental results show that CHOPPER effectively improves workload performance by up to 35.2% compared to standard Spark setup.

[1]  Ashok Kumar Turuk,et al.  Application of greedy algorithms to Virtual Machine Distribution across Data Centers , 2014, 2014 Annual IEEE India Conference (INDICON).

[2]  Y. Charlie Hu,et al.  PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters , 2013, USENIX Annual Technical Conference.

[3]  Abdul Quamar,et al.  SWORD: scalable workload-aware data placement for transactional workloads , 2013, EDBT '13.

[4]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Keshav Pingali,et al.  Optimistic parallelism benefits from data partitioning , 2008, ASPLOS.

[6]  Roberto Palmieri,et al.  Automated Data Partitioning for Highly Scalable and Strongly Consistent Transactions , 2016, IEEE Trans. Parallel Distributed Syst..

[7]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[8]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[9]  Haixun Wang,et al.  Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[10]  Herodotos Herodotou,et al.  Stubby: A Transformation-based Optimizer for MapReduce Workflows , 2012, Proc. VLDB Endow..

[11]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[12]  Ziming Zhong,et al.  Data Partitioning on Multicore and Multi-GPU Platforms Using Functional Performance Models , 2015, IEEE Transactions on Computers.

[13]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[14]  Nicolas Bruno,et al.  Automated partitioning design in parallel database systems , 2011, SIGMOD '11.

[15]  Kenneth A. Ross,et al.  Data partitioning on chip multiprocessors , 2008, DaMoN '08.

[16]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[17]  Charalampos E. Tsourakakis,et al.  FENNEL: streaming graph partitioning for massive scale graphs , 2014, WSDM.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[20]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[21]  Jinyang Li,et al.  Spartan: A Distributed Array Framework with Smart Tiling , 2015, USENIX Annual Technical Conference.

[22]  M. R. Sumalatha,et al.  PDDS - Improving cloud data storage security using data partitioning technique , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[23]  Gang Chen,et al.  LogBase: A Scalable Log-structured Database System in the Cloud , 2012, Proc. VLDB Endow..

[24]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[25]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[26]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[27]  Kenneth A. Ross,et al.  Navigating big data with high-throughput, energy-efficient data partitioning , 2013, ISCA.

[28]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[29]  Tao Li,et al.  A Framework for Partitioning and Execution of Data Stream Applications in Mobile Cloud Computing , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[30]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[31]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .