A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns

A huge increase in data storage and processing requirements has lead to Big Data, for which next generation storage systems are being designed and implemented. However, we have a limited understanding of the workloads of Big Data storage systems. We consider the case of one common type of Big Data storage cluster: a cluster dedicated to supporting a mix of MapReduce jobs. We analyze 6-month traces from two large Hadoop clusters at Yahoo! and characterize the file popularity, temporal locality, and arrival patterns of the workloads. We identify several interesting properties and compare them with previous observations from web and media server workloads. To the best of our knowledge, this is the first study of how MapReduce workloads interact with the storage layer.

[1]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[2]  R. Katz,et al.  Interactive Query Processing in Big Data Systems: A Cross Industry Study of MapReduce Workloads , 2012 .

[3]  Hui Li,et al.  Towards A Better Understanding of Workload Dynamics on Data-Intensive Clusters and Grids , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Cristina L. Abad,et al.  DARE: Adaptive Data Replication for Efficient Cluster Scheduling , 2011, 2011 IEEE International Conference on Cluster Computing.

[5]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[6]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[7]  Evgenia Smirni,et al.  Trace data characterization and fitting for Markov modeling , 2010, Performance evaluation (Print).

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Ludmila Cherkasova,et al.  Analysis of enterprise media server workloads: access patterns, locality, content evolution, and rates of change , 2004, IEEE/ACM Transactions on Networking.

[10]  Yanpei Chen,et al.  Design implications for enterprise storage systems via multi-dimensional trace analysis , 2011, SOSP '11.

[11]  Murad S. Taqqu,et al.  On the Self-Similar Nature of Ethernet Traffic , 1993, SIGCOMM.

[12]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[13]  Garth A. Gibson,et al.  DiskReduce: RAID for data-intensive scalable computing , 2009, PDSW '09.

[14]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[15]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[16]  Lin Xiao,et al.  In Search of an API for Scalable File Systems: Under the Table or Above It? , 2009, HotCloud.

[17]  Kihong Park,et al.  On the relationship between file sizes, transport protocols, and self-similar network traffic , 1996, Proceedings of 1996 International Conference on Network Protocols (ICNP-96).

[18]  Cristina L. Abad,et al.  Metadata Traces and Workload Models for Evaluating Big Storage Systems , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[19]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[20]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.