论文信息 - Understanding Inefficiencies in Data-Intensive Computing

Understanding Inefficiencies in Data-Intensive Computing

Abstract : New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how such systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple model of I/O resource consumption that predicts the ideal lowerbound runtime of a parallel dataflow on a particular set of hardware. Comparing actual system performance to the model's ideal prediction exposes the inefficiency of a scale-out system. Using a simplified dataflow processing tool called Parallel DataSeries we show that the model's ideal can be approached (i.e., that it is not wildly optimistic), but that a gap of up to 20% remains for workloads using up to 45 nodes. Guided by the model, we analyze inefficiencies exposed in both the disk and networking subsystems--issues that will be faced by any DISC system built atop popular commodity hardware and OSs.

[1] David J. DeWitt,et al. Parallel database systems: the future of high performance database systems , 1992, CACM.

[2] John Wilkes,et al. An introduction to disk drive modeling , 1994, Computer.

[3] R. V. Meter. Observing the effects of multi-zone disks , 1997 .

[4] Steven D. Gribble,et al. Robustness in complex systems , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[5] Andrea C. Arpaci-Dusseau,et al. Fail-stutter fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[6] Gregory R. Ganger,et al. Track-Aligned Extents: Matching Access Patterns to Disk Drive Characteristics , 2002, FAST.

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8] Jeffrey C. Mogul,et al. Emergent (mis)behavior vs. complex software systems , 2006, EuroSys.

[9] Srinivasan Seshan,et al. On application-level approaches to avoiding TCP throughput collapse in cluster-based storage systems , 2007, PDSW '07.

[10] Christoforos E. Kozyrakis,et al. Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[11] Gregory R. Ganger,et al. Argon: Performance Insulation for Shared Storage Servers , 2007, FAST.

[12] Randal E. Bryant,et al. Data-Intensive Supercomputing: The case for DISC , 2007 .

[13] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14] Injong Rhee,et al. CUBIC: a new TCP-friendly high-speed TCP variant , 2008, OPSR.

[15] Srinivasan Seshan,et al. Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems , 2008, FAST.

[16] Amin Vahdat,et al. A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[17] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[18] Guanying Wang,et al. A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[19] Amar Phanishayee,et al. Safe and effective fine-grained TCP retransmissions for datacenter communication , 2009, SIGCOMM '09.

[20] Jasleen Kaur,et al. RAPID: Shrinking the Congestion-Control Timescale , 2009, IEEE INFOCOM 2009.

[21] Eric Anderson,et al. DataSeries: an efficient, flexible data format for structured serial data , 2009, OPSR.

[22] Peter Sanders,et al. Scalable distributed-memory external sorting , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[23] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[24] J. R. Santos,et al. Ext 4 block and inode allocator improvements , 2010 .

[25] Michael Stonebraker,et al. MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[26] Amin Vahdat,et al. TritonSort: A Balanced Large-Scale Sorting System , 2011, NSDI.

[27] Gregory R. Ganger,et al. Disks Are Like Snowflakes: No Two Are Alike , 2011, HotOS.