I/O Characteristics Discovery in Cloud Storage Systems

The data growth from many applications in clouds poses significant challenges to cloud storage systems. To deliver the best storage and I/O performance possible, it is often required to understand and leverage the I/O characteristics based on data accesses. A number of research studies have been carried out on this topic. However, most of them either utilize a limited number of data-access attributes, restricting the general applicability of the method for different applications, or heavily rely on the domain knowledge or expertise about applications' I/O behaviors to select the best representative features, introducing bias for certain workloads. To overcome these limitations, in this study, we present a new I/O characteristic discovery methodology. This method enables capturing data-access features as many as possible to eliminate human bias. It utilizes a machine-learning based strategy to derive the most important set of features automatically, and groups data objects with a clustering algorithm (DBSCAN) to reveal I/O characteristics discovered. These I/O characteristics revealed can direct I/O performance optimizations in numerous scenarios, such as in data prefeteching and data reorganization optimizations in cloud storage systems.

[1]  Robert B. Ross,et al.  Provenance-based object storage prediction scheme for scientific big data applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[2]  Albert Y. Zomaya,et al.  A Parallel File System with Application-Aware Data Layout Policies for Massive Remote Sensing Image Processing in Digital Earth , 2015, IEEE Transactions on Parallel and Distributed Systems.

[3]  Jiang Zhou,et al.  Block2Vec: A Deep Learning Strategy on Mining Block Correlations in Storage Systems , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[4]  Surendra Byna,et al.  Boosting Application-Specific Parallel I/O Optimization Using IOSIG , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[5]  Andrea C. Arpaci-Dusseau,et al.  Semantically-Smart Disk Systems , 2003, FAST.

[6]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[7]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[9]  David J. Ketchen,et al.  THE APPLICATION OF CLUSTER ANALYSIS IN STRATEGIC MANAGEMENT RESEARCH: AN ANALYSIS AND CRITIQUE , 1996 .

[10]  Yanpei Chen,et al.  Design implications for enterprise storage systems via multi-dimensional trace analysis , 2011, SOSP '11.

[11]  Carlos Maltzahn,et al.  I/O acceleration with pattern detection , 2013, HPDC.

[12]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[13]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[14]  Stephen A. Jarvis,et al.  Parallel File System Analysis Through Application I/O Tracing , 2013, Comput. J..

[15]  Liu Yang,et al.  Server-Side Log Data Analytics for I/O Workload Characterization and Coordination on Large Shared Storage Systems , 2016 .

[16]  Daniel A. Reed,et al.  Learning to Classify Parallel Input/Output Access Patterns , 2002, IEEE Trans. Parallel Distributed Syst..

[17]  Hong Jiang,et al.  FARMER: A novel approach to file access correlation mining and evaluation reference model , 2008, HPDC '08.

[18]  Robert B. Ross,et al.  Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[20]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, MSST.

[21]  Marianne Winslett,et al.  A multi-level approach for understanding I/O activity in HPC applications , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[22]  Zyad Shaaban,et al.  Data Mining: A Preprocessing Engine , 2006 .

[23]  Rajeev Thakur,et al.  Pattern-Direct and Layout-Aware Replication Scheme for Parallel I/O Systems , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[24]  Daniel A. Reed,et al.  Markov model prediction of I/O requests for scientific applications , 2002, ICS '02.

[25]  Daniel A. Reed,et al.  Exploiting Global Input Output Access Pattern Classification , 1997, SC.

[26]  Yuanyuan Zhou,et al.  Mining block correlations to improve storage performance , 2005, TOS.