I/O Performance Modeling for Big Data Applications over Cloud Infrastructures

Big Data applications receive an ever-increasing amount of attention, thus becoming a dominant class of applications that are deployed over virtualized environments. Cloud environments entail a large amount of complexity relative to I/O performance. The use of Big Data increases the complexity of I/O management as well as its characterization and prediction: As I/O operations become growingly dominant in such applications, the intricacies of virtualization, different storage back ends and deployment setups significantly hinder our ability to analyze and correctly predict I/O performance. To that end, this work proposes an end-to-end modeling technique to predict performance of I/O--intensive Big Data applications running over cloud infrastructures. We develop a model tuned over application and infrastructure dimensions: Primitive I/O operations, data access patterns, storage back ends and deployment parameters. The trained model can be used to predict both I/O but also general task performance. Our evaluation results show that for jobs which are dominated by I/O operations, such as I/O-bound MapReduce jobs, our model is capable of predicting execution time with an accuracy close to 90% that decreases as application processing becomes more complex.

[1]  Devarshi Ghoshal,et al.  I/O performance of virtualized cloud environments , 2011, DataCloud-SC '11.

[2]  Arif Merchant,et al.  A modular, analytical throughput model for modern disk arrays , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[3]  E. Anderson HPL – SSP – 2001 – 4 : Simple table-based modeling of storage devices , 2001 .

[4]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[5]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[7]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[8]  Peter Kilpatrick,et al.  Performance models of storage contention in cloud environments , 2013, Software & Systems Modeling.

[9]  John Shalf,et al.  Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Carlos Maltzahn,et al.  RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[11]  Christina Delimitrou,et al.  Decoupling datacenter studies from access to large-scale applications: A modeling approach for storage workloads , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[12]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[13]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[14]  Christos Faloutsos,et al.  Storage device performance prediction with CART models , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[15]  David A. Patterson,et al.  A new approach to I/O performance evaluation: self-scaling I/O benchmarks, predicted I/O performance , 1993, SIGMETRICS '93.

[16]  Ajay Gulati,et al.  Storage Workload Characterization and Consolidation in Virtualized Environments , 2008 .