Predictive data and energy management in GreenHDFS

The sheer scale and rapid rise of Big Data mandates highly scalable, self-adaptive, and energy-conserving data-intensive compute clusters. Based on our analysis of the traces from a production Hadoop cluster at Yahoo!, we observe that file size, file lifespan, and file heat are statistically correlated and very strongly associated with the hierarchical directory structure (i.e., absolute file path) in which the files are organized. Leveraging that observation, we present predictive GreenHDFS; an energy-conserving variant of the Hadoop distributed file system that uses a supervised machine learning technique to learn the correlation between the directory hierarchy and the file attributes to guide novel predictive file zone placement, migration, and replication policies that significantly outperform the current state-of-the-art reactive approaches. Using real-world traces from a large-scale (2600 servers, 5 Petabytes) production Hadoop cluster at Yahoo! in our GreenHDFS simulations, we show how predictive GreenHDFS results in a much better trade-off between performance and energy consumption.

[1]  Ethem Alpaydin,et al.  Introduction to Machine Learning (Adaptive Computation and Machine Learning) , 2004 .

[2]  Randy H. Katz,et al.  An energy case for hybrid datacenters , 2010, OPSR.

[3]  Gregory R. Ganger,et al.  Attribute-Based Prediction of File Properties , 2003 .

[4]  Thomas P. Ryan,et al.  Modern Regression Methods , 1996 .

[5]  Jeffrey S. Chase,et al.  Energy management for server clusters , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  P. Chatterjee,et al.  Modeling the Clickstream: Implications for Web-Based Advertising Efforts , 2003 .

[9]  George Forman,et al.  Cool Job Allocation: Measuring the Power Savings of Placing Jobs at Cooling-Efficient Locations in the Data Center , 2007, USENIX Annual Technical Conference.

[10]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[11]  Kenneth Salem,et al.  Adaptive block rearrangement , 1993, TOCS.

[12]  Tao Xie,et al.  DORA: A Dynamic File Assignment Strategy with Replication , 2009, 2009 International Conference on Parallel Processing.

[13]  Mark Sweiger,et al.  Clickstream Data Warehousing , 2002 .

[14]  Peter Scheuermann,et al.  File Assignment in Parallel I/O Systems with Minimal Variance of Service Time , 2000, IEEE Trans. Computers.

[15]  Jim Griffioen,et al.  Reducing File System Latency using a Predictive Approach , 1994, USENIX Summer.

[16]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[17]  Jeffrey S. Chase,et al.  Balance of power: dynamic thermal management for Internet data centers , 2005, IEEE Internet Computing.

[18]  GhemawatSanjay,et al.  The Google file system , 2003 .

[19]  Dhabaleswar K. Panda,et al.  Data intensive computing , 2006, SC.

[20]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[21]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[22]  Jeffrey S. Chase,et al.  Balance of Power: Energy Management for Server Clusters , 2001 .

[23]  Klara Nahrstedt,et al.  Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[24]  Daniel A. Reed,et al.  Input/output access pattern classification using hidden Markov models , 1997, IOPADS '97.

[25]  Ahmed Amer,et al.  Predictive data grouping: Defining the bounds of energy and latency reduction through predictive data grouping and replication , 2008, TOS.

[26]  Dirk Van den Poel,et al.  Predicting online-purchasing behaviour , 2005, Eur. J. Oper. Res..

[27]  Daniel A. Reed,et al.  Markov model prediction of I/O requests for scientific applications , 2002, ICS '02.

[28]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[29]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[30]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.