Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In this paper, we present SkewReduce, a new system implemented on top of Hadoop that enables users to easily express feature extraction analyses and execute them efficiently. At the heart of the SkewReduce system is an optimizer, parameterized by user-defined cost functions, that determines how best to partition the input data to minimize computational skew. Experiments on real data from two different science domains demonstrate that our approach can improve execution times by a factor of up to 8 compared to a naive implementation.

[1]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[2]  G. Efstathiou,et al.  The evolution of large-scale structure in a universe dominated by cold dark matter , 1985 .

[3]  Shahid H. Bokhari,et al.  A Partitioning Strategy for Nonuniform Problems on Multiprocessors , 1987, IEEE Transactions on Computers.

[4]  Alfred G. Dale,et al.  A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins , 1991, VLDB.

[5]  Kien A. Hua,et al.  Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning , 1991, VLDB.

[6]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[7]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[8]  J. M. Gelb,et al.  Cold dark matter. 1: The Formation of dark halos , 1994, astro-ph/9408028.

[9]  Jeffrey F. Naughton,et al.  Adaptive parallel aggregation algorithms , 1995, SIGMOD '95.

[10]  D. Weinberg,et al.  Photoionization, numerical resolution, and galaxy formation , 1996, astro-ph/9604175.

[11]  Leonid Oliker,et al.  PLUM: Parallel Load Balancing for Adaptive Unstructured Meshes , 1998, J. Parallel Distributed Comput..

[12]  Kinji Ono,et al.  Cost estimation of user-defined methods in object-relational database systems , 1999, SGMD.

[13]  J. Stadel Cosmological N-body simulations and their analysis , 2001 .

[14]  Courtenay T. Vaughan,et al.  Zoltan data management services for parallel dynamic applications , 2002, Comput. Sci. Eng..

[15]  Wei Li,et al.  Skew handling techniques in sort-merge join , 2002, SIGMOD '02.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[18]  Richard P. Mount The Office of Science Data-Management Challenge , 2005 .

[19]  Zhen Li,et al.  AutoMate: Enabling Autonomic Applications on the Grid , 2006, Cluster Computing.

[20]  George Karypis,et al.  Partitioning and Load Balancing for Emerging Parallel Applications and Architectures , 2006, Parallel Processing for Scientific Computing.

[21]  J. S. Saini,et al.  Adaptive Query Processing , 2006 .

[22]  Jim Gray,et al.  2020 Computing: Science in an exponential world , 2006, Nature.

[23]  David Maier,et al.  Smoothing the ROI Curve for Scientific Data Management Applications , 2007, CIDR.

[24]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[25]  David J. DeWitt,et al.  Clustera: an integrated computation and data management system , 2008, Proc. VLDB Endow..

[26]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[27]  Liang Chen,et al.  Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.

[28]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[29]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[30]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[31]  Yu Xu,et al.  Efficient Outer Join Data Skew Handling in Parallel DBMS , 2009, Proc. VLDB Endow..

[32]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[33]  A. Knebe,et al.  Ahf: AMIGA'S HALO FINDER , 2009, 0904.3662.

[34]  Magdalena Balazinska,et al.  Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster , 2010, SSDBM.

[35]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .