论文信息 - I/O Performance Modeling for Big Data Applications over Cloud Infrastructures

I/O Performance Modeling for Big Data Applications over Cloud Infrastructures

Big Data applications receive an ever-increasing amount of attention, thus becoming a dominant class of applications that are deployed over virtualized environments. Cloud environments entail a large amount of complexity relative to I/O performance. The use of Big Data increases the complexity of I/O management as well as its characterization and prediction: As I/O operations become growingly dominant in such applications, the intricacies of virtualization, different storage back ends and deployment setups significantly hinder our ability to analyze and correctly predict I/O performance. To that end, this work proposes an end-to-end modeling technique to predict performance of I/O--intensive Big Data applications running over cloud infrastructures. We develop a model tuned over application and infrastructure dimensions: Primitive I/O operations, data access patterns, storage back ends and deployment parameters. The trained model can be used to predict both I/O but also general task performance. Our evaluation results show that for jobs which are dominated by I/O operations, such as I/O-bound MapReduce jobs, our model is capable of predicting execution time with an accuracy close to 90% that decreases as application processing becomes more complex.

Verena Kantere | Dimitrios Tsoumakos | Nectarios Koziris | Anastassios Nanos | Ioannis Mytilinis

[1] Devarshi Ghoshal,et al. I/O performance of virtualized cloud environments , 2011, DataCloud-SC '11.

[2] Arif Merchant,et al. A modular, analytical throughput model for modern disk arrays , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[3] E. Anderson. HPL – SSP – 2001 – 4 : Simple table-based modeling of storage devices , 2001 .

[4] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[5] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6] Herodotos Herodotou,et al. Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[7] S.A. Brandt,et al. CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[8] Peter Kilpatrick,et al. Performance models of storage contention in cloud environments , 2013, Software & Systems Modeling.

[9] John Shalf,et al. Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10] Carlos Maltzahn,et al. RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[11] Christina Delimitrou,et al. Decoupling datacenter studies from access to large-scale applications: A modeling approach for storage workloads , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[12] John Wilkes,et al. An introduction to disk drive modeling , 1994, Computer.

[13] Sally A. McKee,et al. An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[14] Christos Faloutsos,et al. Storage device performance prediction with CART models , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[15] David A. Patterson,et al. A new approach to I/O performance evaluation: self-scaling I/O benchmarks, predicted I/O performance , 1993, SIGMETRICS '93.

[16] Ajay Gulati,et al. Storage Workload Characterization and Consolidation in Virtualized Environments , 2008 .