Twister: a runtime for iterative MapReduce

MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of experience in applying MapReduce to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. In this paper, we present the programming model and the architecture of Twister an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. We also show performance comparisons of Twister with other similar runtimes such as Hadoop and DryadLINQ for large scale data parallel applications.

[1]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Forrest W. Young,et al.  Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features , 1977 .

[4]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[5]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[6]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  J. Leeuw Applications of Convex Analysis to Multidimensional Scaling , 2000 .

[9]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[10]  Gene H. Golub,et al.  Exploiting the Block Structure of the Web for Computing , 2003 .

[11]  GhemawatSanjay,et al.  The Google file system , 2003 .

[12]  Geoffrey C. Fox,et al.  NaradaBrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids , 2003, Middleware.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Shaozhi Ye,et al.  Distributed PageRank computation based on iterative aggregation-disaggregation methods , 2005, CIKM '05.

[15]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[16]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[17]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[18]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[19]  Geoffrey C. Fox,et al.  Parallel Data Mining from Multicore to Cloudy Grids , 2008, High Performance Computing Workshop.

[20]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[21]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[22]  Geoffrey C. Fox,et al.  DryadLINQ for Scientific Analyses , 2009, 2009 Fifth IEEE International Conference on e-Science.

[23]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[24]  Geoffrey C. Fox,et al.  High Performance Parallel Computing with Clouds and Cloud Technologies , 2009, CloudComp.

[25]  Douglas Thain,et al.  All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids , 2010, IEEE Transactions on Parallel and Distributed Systems.