INSTalytics

We present the design, implementation, and evaluation of Instalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big data analytics in large-scale data centers. Instalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, Instalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, Instalytics uses compute-awareness to customize the 3-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables Instalytics to preserve the same recovery cost and availability as traditional replication. Instalytics also uses compute-awareness to expose a new {\em sliced-read} API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have implemented Instalytics in a production analytics stack, and show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.

[1]  Kannan Ramchandran,et al.  A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers , 2014 .

[2]  Michael J. Freedman,et al.  Riffle: optimized shuffle service for large-scale data analytics , 2018, EuroSys.

[3]  Jorge-Arnulfo Quiané-Ruiz,et al.  Only Aggressive Elephants are Fast Elephants , 2012, Proc. VLDB Endow..

[4]  Andrea C. Arpaci-Dusseau,et al.  Semantically-Smart Disk Systems , 2003, FAST.

[5]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[8]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[9]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[10]  Andrea C. Arpaci-Dusseau,et al.  Information and control in gray-box systems , 2001, SOSP.

[11]  GhemawatSanjay,et al.  The Google file system , 2003 .

[12]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[13]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[14]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[15]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[16]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[17]  Ittai Abraham,et al.  Replex: A Scalable, Highly Available Multi-Index Data Store , 2016, USENIX Annual Technical Conference.

[18]  Jingren Zhou,et al.  Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[19]  Samuel Madden,et al.  A robust partitioning scheme for ad-hoc query workloads , 2017, SoCC.

[20]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[21]  Chris Douglas,et al.  Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics , 2017, SIGMOD Conference.

[22]  David J. DeWitt,et al.  Chained declustering: a new availability strategy for multiprocessor database machines , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[23]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[24]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[27]  Reynold Xin,et al.  Apache Spark , 2016 .

[28]  Christopher Olston,et al.  Interactive Analysis of Web-Scale Data , 2009, CIDR.

[29]  Liwen Sun,et al.  A Partitioning Framework for Aggressive Data Skipping , 2014, Proc. VLDB Endow..

[30]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[31]  Willy Zwaenepoel,et al.  Rock you like a hurricane: taming skew in large scale analytics , 2018, EuroSys.

[32]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[33]  Yogen K. Dalal,et al.  Pilot: an operating system for a personal computer , 1980, CACM.

[34]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.