论文信息 - Applying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing

Applying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing

New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple analytical model that predicts the theoretic ideal performance of a parallel dataflow system. The model exposes the inefficiency of popular scale-out systems, which take 3–13× longer to complete jobs than the hardware should allow, even in well-tuned systems used to achieve record-breaking benchmark results. Using a simplified dataflow processing tool called Parallel DataSeries, we show that the model’s ideal can be approached (i.e., that it is not wildly optimistic), coming within 10–14% of the model’s prediction. Moreover, guided by the model, we present analysis of inefficiencies which exposes issues in both the disk and networking subsystems that will be faced by any DISC system built atop standard OS and networking services. Acknowledgements: We thank the members and companies of the PDL Consortium (including APC, EMC, Facebook, Google, HewlettPackard Labs, Hitachi, IBM, Intel, LSI, Microsoft Research, NEC Laboratories, NetApp, Oracle, Riverbed, Samsung, Seagate, STEC, Symantec, VMWare, and Yahoo! Labs) for their interest, insights, feedback, and support. This research was sponsored in part by an HP Innovation Research Award and by CyLab at Carnegie Mellon University under grant DAAD19–02–1–0389 from the Army Research Office.

[1] David J. DeWitt,et al. Parallel database systems: the future of high performance database systems , 1992, CACM.

[2] David J. DeWitt,et al. Parallel Database Systems: The Future of High Performance Database Processing 1 , 1992 .

[3] John Wilkes,et al. An introduction to disk drive modeling , 1994, Computer.

[4] R. V. Meter. Observing the effects of multi-zone disks , 1997 .

[5] Gregory R. Ganger,et al. Track-Aligned Extents: Matching Access Patterns to Disk Drive Characteristics , 2002, FAST.

[6] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7] Christoforos E. Kozyrakis,et al. Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[8] Gregory R. Ganger,et al. Argon: Performance Insulation for Shared Storage Servers , 2007, FAST.

[9] Randal E. Bryant,et al. Data-Intensive Supercomputing: The case for DISC , 2007 .

[10] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[11] Injong Rhee,et al. CUBIC: a new TCP-friendly high-speed TCP variant , 2008, OPSR.

[12] Amin Vahdat,et al. A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[13] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[14] Guanying Wang,et al. A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[15] Amar Phanishayee,et al. Safe and effective fine-grained TCP retransmissions for datacenter communication , 2009, SIGCOMM '09.

[16] Jasleen Kaur,et al. RAPID: Shrinking the Congestion-Control Timescale , 2009, IEEE INFOCOM 2009.

[17] Eric Anderson,et al. DataSeries: an efficient, flexible data format for structured serial data , 2009, OPSR.

[18] Peter Sanders,et al. Scalable distributed-memory external sorting , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[19] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[20] J. R. Santos,et al. Ext 4 block and inode allocator improvements , 2010 .

[21] Michael Stonebraker,et al. MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[22] Gregory R. Ganger,et al. Disks Are Like Snowflakes: No Two Are Alike , 2011, HotOS.