Towards Seamless Configuration Tuning of Big Data Analytics

The execution of distributed data processing workloads (such as those running on top of Hadoop or Spark) in cloud environments presents a unique opportunity to explore multiple trade-offs between elasticity (and types of resources being allocated), overall runtime and total costs. However, beyond high-level constraints and objectives, it's not the end-users who should be mainly concerned with those optimizations, but the cloud providers. They have both the vantage point to collect actionable information, economies of scale and position to adjust parameters when dynamic conditions change, in order to fulfil SLOs that go beyond classic measures of latency and throughput. This is at odds with the existing approach of making software (including the interfaces to the cloud and the processing frameworks) as configurable as possible. We propose that rather than configurability, self-tunability (or the illusion of it as far as the end-user is concerned) is a better long-term goal.

[1]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[2]  Sally A. McKee,et al.  Characterizing and subsetting big data workloads , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[4]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[5]  Cheng-Zhong Xu,et al.  A Reinforcement Learning Approach to Online Web Systems Auto-configuration , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[6]  Tamiya Onodera,et al.  Workload characterization and optimization of TPC-H queries on Apache Spark , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[8]  Jing Gao,et al.  On handling negative transfer and imbalanced distributions in multiple source transfer learning , 2014, SDM.

[9]  Kevin Jacobs,et al.  Apache Flink: Distributed Stream Data Processing , 2016 .

[10]  Zhao Zhang,et al.  Scientific computing meets big data technology: An astronomy use case , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[11]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[12]  Yuqing Zhu,et al.  BestConfig: tapping the performance potential of systems via automatic configuration tuning , 2017, SoCC.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Xiandong Meng,et al.  SpaRC: Scalable Sequence Clustering using Apache Spark , 2018 .

[16]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Palden Lama,et al.  AROMA: automated resource allocation and configuration of mapreduce environment in the cloud , 2012, ICAC '12.

[19]  Matei Zaharia,et al.  Resilient Distributed Datasets , 2016 .

[20]  Carl E. Rasmussen,et al.  Additive Gaussian Processes , 2011, NIPS.

[21]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[22]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[23]  Xuehai Qian,et al.  Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing , 2018, ASPLOS.

[24]  Kushal Datta,et al.  Gunther: Search-Based Auto-Tuning of MapReduce , 2013, Euro-Par.

[25]  Li Zhang,et al.  MRONLINE: MapReduce online performance tuning , 2014, HPDC '14.