Data and container placement in scalable data analytics platforms

Distributed dataflow systems process large volume of data in parallel on multiple machines. In production, multiple dataflow applications are scheduled for execution in virtual containers on a per-job basis. Furthermore, they access datasets partitioned into datablocks across the cluster machines’ disks. Runtime performance is important for many of these jobs, as their users expect fast results. However, optimizing performance is difficult, because dataflow jobs are very diverse and used in a wide variety of domains such as relational processing, machine learning, and graph processing. Container and datablock placement decisions impact a job’s runtime performance significantly. Furthermore, changing placements affects runtime performance without modifying the application’s code, and thus can be applied to many jobs without much configuration effort from the user’s side. However, jobs benefit differently from placement decisions, because their resource demands differ from job to job. Hence, there is not a single placement strategy that is optimal for all possible jobs. Besides that, users require a secure long-term data retention for their documents and datasets. This thesis presents container and datablock placement strategies to optimize the runtime performance of distributed dataflow applications running on shared data analytics platforms. It contributes two placement methods for this. The first method improves the efficiency of a job’s dataflow operations and the degree of data locality by colocating its input datablocks and containers on a selected set of nodes. The second method places a job’s containers based on network distances between containers and its input datablocks as well as container interference. In addition, this thesis explores the problem of data retention in shared data analytics platforms. Therefore, it contributes a method of storing and accessing lineage metadata through smart-contracts executed on a decentralized blockchain network. The methods presented in this thesis have been implemented in a research prototype that has been integrated with Hadoop and Ethereum. For evaluation, we used a 64 nodes commodity cluster and workloads consisting of applications implemented in Flink from the domains of relational processing, machine learning, and graph processing. We compared the runtime performance of workloads scheduled with our methods with Hadoop’s default placement method. For our blockchain-based data retention method, we measured overhead in terms of additional response time and reported costs using it on Ethereum’s blockchain network.

[1]  María S. Pérez-Hernández,et al.  Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  Keith Kirkpatrick,et al.  Software-defined networking , 2013, CACM.

[3]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[4]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[6]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[7]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[8]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[9]  Odej Kao,et al.  CoLoc: Distributed data and container colocation for data-intensive applications , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[10]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[11]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[12]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[13]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[14]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[15]  Ralph C. Merkle,et al.  Protocols for Public Key Cryptosystems , 1980, 1980 IEEE Symposium on Security and Privacy.

[16]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[17]  Yi Lu,et al.  AdaptDB: Adaptive Partitioning for Distributed Joins , 2017, Proc. VLDB Endow..

[18]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[19]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[20]  Felix Naumann,et al.  Meteor/Sopremo: An Extensible Query Language and Operator Model , 2012 .

[21]  Jaehwan Lee,et al.  Introducing SSDs to the Hadoop MapReduce Framework , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[22]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[23]  Ananth Grama,et al.  UBIS: Utilization-Aware Cluster Scheduling , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[24]  Odej Kao,et al.  Endolith: A Blockchain-Based Framework to Enhance Data Retention in Cloud Storages , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[25]  Bin Cheng,et al.  Building a Big Data Platform for Smart Cities: Experience and Lessons from Santander , 2015, 2015 IEEE International Congress on Big Data.

[26]  Dick H. J. Epema,et al.  KOALA-F: A Resource Manager for Scheduling Frameworks in Clusters , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[27]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[28]  Alex Davies,et al.  Scale out with GlusterFS , 2013 .

[29]  Odej Kao,et al.  Scheduling Recurring Distributed Dataflow Jobs Based on Resource Utilization and Interference , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[30]  Luke M. Leslie,et al.  Cross-Layer Scheduling in Cloud Systems , 2015, 2015 IEEE International Conference on Cloud Engineering.

[31]  Massimo Bartoletti,et al.  A Survey of Attacks on Ethereum Smart Contracts (SoK) , 2017, POST.

[32]  Yi Lu,et al.  Amoeba: A Shape changing Storage System for Big Data , 2016, Proc. VLDB Endow..

[33]  Satoshi Nakamoto Bitcoin : A Peer-to-Peer Electronic Cash System , 2009 .

[34]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[35]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[36]  Sungyoung Lee,et al.  Adaptive Replication Management in HDFS Based on Supervised Learning , 2016, IEEE Transactions on Knowledge and Data Engineering.

[37]  Yongfeng Huang,et al.  Hmfs: Efficient Support of Small Files Processing over HDFS , 2014, ICA3PP.

[38]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[39]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[40]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[41]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[42]  Marios Hadjieleftheriou,et al.  Distributed data placement to minimize communication costs via graph partitioning , 2014, SSDBM '14.

[43]  Peter R. Pietzuch,et al.  Medea: scheduling of long running applications in shared production clusters , 2018, EuroSys.

[44]  Goutam Paul,et al.  Exploiting Block-Chain Data Structure for Auditorless Auditing on Cloud Data , 2016, ICISS.

[45]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[46]  Gilad Mishne,et al.  Fast data in the era of big data: Twitter's real-time related query suggestion architecture , 2012, SIGMOD '13.

[47]  Hitesh Ballani,et al.  Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[48]  Malte Schwarzkopf Cluster Scheduling for Data Centers , 2017, ACM Queue.

[49]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[50]  Odej Kao,et al.  Nephele: efficient parallel data processing in the cloud , 2009, MTAGS '09.

[51]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[52]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[53]  Maged M. Michael,et al.  Scale-up x Scale-out: A Case Study using Nutch/Lucene , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[54]  Huan Liu,et al.  GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[55]  Ning Zhang,et al.  ERMS: An Elastic Replication Management System for HDFS , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[56]  Muneeb Ali,et al.  Blockstack: A Global Naming and Storage System Secured by Blockchains , 2016, USENIX Annual Technical Conference.

[57]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[58]  Bin Xu,et al.  Proactive Data Placement for Surveillance Video Processing in Heterogeneous Cluster , 2016, 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[59]  Felix Naumann,et al.  SOFA: An extensible logical optimizer for UDF-heavy data flows , 2015, Inf. Syst..

[60]  Odej Kao,et al.  Continuously Improving the Resource Utilization of Iterative Parallel Dataflows , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems Workshops (ICDCSW).

[61]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[62]  Ion Stoica,et al.  The Power of Choice in Data-Aware Cluster Scheduling , 2014, OSDI.

[63]  Odej Kao,et al.  SMiPE: Estimating the Progress of Recurring Iterative Distributed Dataflows , 2017, 2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT).

[64]  Odej Kao,et al.  Addressing Hadoop's Small File Problem With an Appendable Archive File Format , 2017, Conf. Computing Frontiers.

[65]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[66]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[67]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[68]  Cheng-Zhong Xu,et al.  Interference and locality-aware task scheduling for MapReduce applications in virtual clusters , 2013, HPDC.

[69]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[70]  Bin Liu,et al.  EthDrive: A Peer-to-Peer Data Storage with Provenance , 2017, CAiSE-Forum-DC.

[71]  Kostas Katrinis,et al.  Pythia: Faster Big Data in Motion through Predictive Software-Defined Network Optimization at Runtime , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[72]  Sachin Shetty,et al.  ProvChain: A Blockchain-Based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[73]  GhemawatSanjay,et al.  The Google file system , 2003 .

[74]  Robert N. M. Watson,et al.  Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.

[75]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[76]  Cristina L. Abad,et al.  DARE: Adaptive Data Replication for Efficient Cluster Scheduling , 2011, 2011 IEEE International Conference on Cluster Computing.

[77]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[78]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[79]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[80]  Ali Raza Butt,et al.  VENU: Orchestrating SSDs in hadoop storage , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[81]  Brian Lee,et al.  Towards Secure Provenance in the Cloud: A Survey , 2015, 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC).

[82]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .

[83]  Yanpei Chen,et al.  The Truth About MapReduce Performance on SSDs , 2014, LISA.

[84]  Ching-Hsien Hsu,et al.  Locality and loading aware virtual machine mapping techniques for optimizing communications in MapReduce applications , 2015, Future Gener. Comput. Syst..

[85]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[86]  Odej Kao,et al.  Network-aware resource management for scalable data analytics frameworks , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[87]  Julie McLeod,et al.  Opening research data: issues and opportunities , 2014 .

[88]  John Murphy,et al.  Towards a Better Replica Management for Hadoop Distributed File System , 2018, 2018 IEEE International Congress on Big Data (BigData Congress).

[89]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[90]  Odej Kao,et al.  Selecting resources for distributed dataflow systems according to runtime targets , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[91]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[92]  Odej Kao,et al.  Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring , 2017, DATA.

[93]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[94]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[95]  Murat Kantarcioglu,et al.  SmartProvenance: A Distributed, Blockchain Based DataProvenance System , 2018, CODASPY.

[96]  Odej Kao,et al.  When to Use a Distributed Dataflow Engine: Evaluating the Performance of Apache Flink , 2016, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld).

[97]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[98]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[99]  Juan Benet,et al.  IPFS - Content Addressed, Versioned, P2P File System , 2014, ArXiv.

[100]  Elaine Shi,et al.  Permacoin: Repurposing Bitcoin Work for Data Preservation , 2014, 2014 IEEE Symposium on Security and Privacy.

[101]  Yuhong Feng,et al.  An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments , 2011, 2011 International Conference on Cloud and Service Computing.

[102]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[103]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[104]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[105]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[106]  Ishai Menache,et al.  Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can , 2015, Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication.

[107]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[108]  Daniel Davis Wood,et al.  ETHEREUM: A SECURE DECENTRALISED GENERALISED TRANSACTION LEDGER , 2014 .

[109]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[110]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[111]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[112]  Seif Haridi,et al.  HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases , 2016, FAST.

[113]  Odej Kao,et al.  Ellis: Dynamically Scaling Distributed Dataflows to Meet Runtime Targets , 2017, 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[114]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[115]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[116]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[117]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[118]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[119]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[120]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.