Applying Performance Models to Understand Data-Intensive Computing Efficiency

Abstract : New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple analytical model that predicts the optimal performance of a parallel dataflow system. The model exposes the inefficiency of popular scale-out systems, which take 3-13x longer to complete jobs than the hardware should allow, even in well-tuned systems used to achieve record-breaking benchmark results. To validate the sanity of our model, we present small-scale experiments with Hadoop and a simplified dataflow processing tool called Parallel DataSeries. Parallel DataSeries achieves performance close to the analytic optimal, showing that the model is realistic and that large improvements in the efficiency of parallel analytics are possible.

[1]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[2]  J. R. Santos,et al.  Ext 4 block and inode allocator improvements , 2010 .

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Noah Treuhaft,et al.  Cluster I/O with River: making the fast case common , 1999, IOPADS '99.

[5]  Randal E. Bryant,et al.  Data-Intensive Supercomputing: The case for DISC , 2007 .

[6]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[7]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[8]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[9]  Injong Rhee,et al.  CUBIC: a new TCP-friendly high-speed TCP variant , 2008, OPSR.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[12]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[13]  Jasleen Kaur,et al.  RAPID: Shrinking the Congestion-Control Timescale , 2009, IEEE INFOCOM 2009.

[14]  Gregory R. Ganger,et al.  Argon: Performance Insulation for Shared Storage Servers , 2007, FAST.

[15]  David R. O'Hallaron,et al.  Tashi: location-aware cluster management , 2009, ACDC '09.

[16]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[17]  Eric Anderson,et al.  DataSeries: an efficient, flexible data format for structured serial data , 2009, OPSR.

[18]  Gregory R. Ganger,et al.  Track-Aligned Extents: Matching Access Patterns to Disk Drive Characteristics , 2002, FAST.