Shared I/O Scheduling in Cloud for Structured Data Processing

Cloud plays an important role in the structure data processing because of its high I/O throughput and excellent capability of computing. At present, since the importance of structure data processing has been greater than before, cloud has confront with a higher pressure of data processing requirement. Massive tasks of data query and analysis, which send innumerable data I/O requests, are running on cloud, bringing unprecedented problems on I/O scheduling. In this paper, a kind of shared I/O scheduling method is proposed. Firstly, during the scheduling, the difference of performance among nodes has been considered. A mathematic model of finish time predication is established to estimate the time cost of each node to finish the tasks. And the requests will be finally assigned to the nodes which have lowest time cost. Meanwhile, to further save the I/O resources, a shared I/O mechanism which merges the requests to same table to a shared request has been raised. By using the shared I/O mechanism, the number of the requests can be prominently decreased and the performance of concurrent data queries can also be improved since repetitive read has been avoided. In the end of the paper, we evaluate the performance of the method by several experiments. The results indicate that the shared I/O scheduling method can effectively save the I/O resources and improve the performance of data processing, having wide range of potential applications.

[1]  Song Jiang,et al.  Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters , 2014, ISC.

[2]  Marios D. Dikaiakos,et al.  Cloud Computing: Distributed Internet Computing for IT and Scientific Research , 2009, IEEE Internet Computing.

[3]  Ya Wang,et al.  Cloud Storage as the Infrastructure of Cloud Computing , 2010, 2010 International Conference on Intelligent Computing and Cognitive Informatics.

[4]  GhemawatSanjay,et al.  The Google file system , 2003 .

[5]  Robert Mateescu,et al.  Priority IO Scheduling in the Cloud , 2013, HotCloud.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Boaz Patt-Shamir,et al.  Competitive Router Scheduling with Structured Data , 2011, WAOA.

[8]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[9]  Wolfgang Lehner,et al.  SAP HANA database: data management for modern business applications , 2012, SGMD.

[10]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[11]  Tanja Zseby,et al.  Empirical evaluation of hash functions for multipoint measurements , 2008, CCRV.

[12]  Cheng-Zhong Xu,et al.  Interference and locality-aware task scheduling for MapReduce applications in virtual clusters , 2013, HPDC.

[13]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[14]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[15]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..