Towards An Efficient Cloud Computing System: Data Management, Resource Allocation and Job Scheduling

Cloud computing is an emerging technology in distributed computing, and it has proved to be an effective infrastructure to provide services to users. Cloud is developing day by day and faces many challenges. One of challenges is to build cost-effective data management system that can ensure high data availability while maintaining consistency. Another challenge in cloud is efficient resource allocation which ensures high resource utilization and high SLO availability. Scheduling, referring to a set of policies to control the order of the work to be performed by a computer system, for high throughput is another challenge. In this dissertation, we study how to manage data and improve data availability while reducing cost (i.e., consistency maintenance cost and storage cost); how to efficiently manage the resource for processing jobs and increase the resource utilization with high SLO availability; how to design an efficient scheduling algorithm which provides high throughput, low overhead while satisfying the demands on completion time of jobs. Replication is a common approach to enhance data availability in cloud storage systems. Previously proposed replication schemes cannot effectively handle both correlated and non-correlated machine failures while increasing the data availability with the limited resource. The schemes for correlated machine failures must create a constant number of replicas for each data object, which neglects diverse data popularities and cannot utilize the resource to maximize the expected data availability. Also, the previous schemes neglect the consistency maintenance cost and the storage cost caused by replication. It is critical for cloud providers to maximize data availability hence minimize SLA (Service Level Agreement) violations while minimize cost caused by replication in order to maximize the revenue. In this dissertation, we build a nonlinear programming model to maximize data availability in both types of failures and minimize the cost caused by replication. Based on the model’s solution for the replication degree of each data object, we propose a low-cost multi-failure resilient replication scheme (MRR). MRR can effectively handle both correlated and

[1]  Taoufik En-Najjary,et al.  Proactive replication in distributed storage systems using machine availability estimation , 2007, CoNEXT '07.

[2]  Suman Banerjee,et al.  An ensemble of replication and erasure codes for cloud file systems , 2013, 2013 Proceedings IEEE INFOCOM.

[3]  Francisco Vilar Brasileiro,et al.  Long-term SLOs for reclaimed cloud computing resources , 2014, SoCC.

[4]  Sachin Katti,et al.  Copysets: Reducing the Frequency of Data Loss in Cloud Storage , 2013, USENIX Annual Technical Conference.

[5]  Ben Y. Zhao,et al.  Exploiting locality of interest in online social networks , 2010, CoNEXT.

[6]  Lee C. Potter,et al.  Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[7]  Li Xiao,et al.  Adaptive and virtual reconfigurations for effective dynamic job scheduling in cluster systems , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[8]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[9]  Venkata Subba Reddy,et al.  Data Management Challenges In Cloud Computing Infrastructures , 2014 .

[10]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[11]  Franck Cappello,et al.  Optimization of cloud task processing with checkpoint-restart mechanism , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Witold Litwin,et al.  LH*RS: a high-availability scalable distributed data structure using Reed Solomon Codes , 2000, SIGMOD '00.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Amin Vahdat,et al.  The costs and limits of availability for replicated services , 2001, TOCS.

[16]  Lei Ying,et al.  Map task scheduling in MapReduce with data locality: Throughput and heavy-traffic optimality , 2013, INFOCOM.

[17]  Haiying Shen,et al.  A Survey of Mobile Crowdsensing Techniques: A Critical Component for the Internet of Things , 2016, ICCCN.

[18]  Anja Feldmann,et al.  Optimal online scheduling of parallel jobs with dependencies , 1993, STOC.

[19]  Haiying Shen,et al.  CORP: Cooperative Opportunistic Resource Provisioning for Short-Lived Jobs in Cloud Systems , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[20]  Prashant J. Shenoy,et al.  A flexible elastic control plane for private clouds , 2013, CAC.

[21]  Jeffrey Dean,et al.  Evolution and future directions of large-scale storage and computation systems at Google , 2010, SoCC '10.

[22]  Haiying Shen,et al.  SCPS: A Social-Aware Distributed Cyber-Physical Human-Centric Search Engine , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[23]  Ishai Menache,et al.  Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can , 2015, Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication.

[24]  Cristian Ungureanu,et al.  Revisiting storage for smartphones , 2012, TOS.

[25]  Mark Stamp,et al.  A Revealing Introduction to Hidden Markov Models , 2017 .

[26]  Ion Stoica,et al.  True elasticity in multi-tenant data-intensive compute clusters , 2012, SoCC '12.

[27]  Jian Yang,et al.  Mojim: A Reliable and Highly-Available Non-Volatile Memory System , 2015, ASPLOS.

[28]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[29]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .

[30]  Akshat Verma,et al.  Service deactivation aware placement and defragmentation in enterprise clouds , 2011, 2011 7th International Conference on Network and Service Management.

[31]  Robbert van Renesse,et al.  Leveraging sharding in the design of scalable replication protocols , 2013, SoCC.

[32]  Nicholas D. Lane,et al.  Can Deep Learning Revolutionize Mobile Sensing? , 2015, HotMobile.

[33]  Jin Li,et al.  SocialTube: P2P-Assisted Video Sharing in Online Social Networks , 2012, IEEE Transactions on Parallel and Distributed Systems.

[34]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[35]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[36]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[37]  Martin Schulz,et al.  Practical Resource Management in Power-Constrained, High Performance Computing , 2015, HPDC.

[38]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[39]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[40]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[41]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[42]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[43]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[44]  Hani Jamjoom,et al.  Pico replication: a high availability framework for middleboxes , 2013, SoCC.

[45]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[46]  Karl Aberer,et al.  A self-organized, fault-tolerant and scalable replication scheme for cloud storage , 2010, SoCC '10.

[47]  Veena Rawat,et al.  Reducing Failure Probability of cloud storage services using Multi-Clouds , 2013, ArXiv.

[48]  Sudipto Guha,et al.  Throughput maximization of real-time scheduling with batching , 2002, SODA '02.

[49]  Fei-Yue Wang,et al.  Traffic Flow Prediction With Big Data: A Deep Learning Approach , 2015, IEEE Transactions on Intelligent Transportation Systems.

[50]  Cristina L. Abad,et al.  Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters , 2013, SoCC.

[51]  Sandhya Dwarkadas,et al.  Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval , 2004, NSDI.

[52]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[53]  Tugba Taskaya-Temizel,et al.  Configuration of Neural Networks for the Analysis of Seasonal Time Series , 2005, ICAPR.

[54]  Frederick S. Hillier,et al.  Introduction of Operations Research , 1967 .

[55]  Timothy Roscoe,et al.  Resource overbooking and application profiling in shared hosting platforms , 2002, OSDI '02.

[56]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[57]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[58]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[59]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[60]  Kang Chen,et al.  DSearching: Distributed searching of mobile nodes in DTNs with floating mobility information , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[61]  Jennifer Rexford,et al.  NoHype: virtualized cloud infrastructure without the virtualization , 2010, ISCA.

[62]  Marty Humphrey,et al.  Auto-scaling to minimize cost and meet application deadlines in cloud workflows , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[63]  Michael J. Franklin,et al.  Dynamic Pipeline Scheduling for Improving Interactive Query Performance , 2001, VLDB.

[64]  Emin Gün Sirer,et al.  Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication , 2015, USENIX Annual Technical Conference.

[65]  Ion Stoica,et al.  The Power of Choice in Data-Aware Cluster Scheduling , 2014, OSDI.

[66]  Eugene L. Lawler,et al.  On Preemptive Scheduling of Unrelated Parallel Processors by Linear Programming , 1978, JACM.

[67]  Nithin Nakka,et al.  Detailed analysis of I/O traces for large scale applications , 2009, 2009 International Conference on High Performance Computing (HiPC).

[68]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[69]  Seung-won Hwang,et al.  Scalable Load Balancing in Cluster Storage Systems , 2011, Middleware.

[70]  Zhenlong Yuan,et al.  Droid-Sec: deep learning in android malware detection , 2015, SIGCOMM 2015.

[71]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[72]  Liang Tang,et al.  Applying data mining techniques to address critical process optimization needs in advanced manufacturing , 2014, KDD.

[73]  Norman M. Sadeh,et al.  Decentralized Preemptive Scheduling Across Heterogeneous Multi-core Grid Resources , 2013, JSSPP.

[74]  Leandros Tassiulas,et al.  Dynamic server allocation to parallel queues with randomly varying connectivity , 1993, IEEE Trans. Inf. Theory.

[75]  Robert J. Chansler,et al.  Data Availability and Durability with the Hadoop Distributed File System , 2012, login Usenix Mag..

[76]  Saeed Parsa,et al.  Task graph pre-scheduling, using Nash equilibrium in game theory , 2013, The Journal of Supercomputing.

[77]  Pierre Baldi,et al.  Deep autoencoder neural networks for gene ontology annotation predictions , 2014, BCB.

[78]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[79]  Ming Zhong,et al.  Replication degree customization for high availability , 2008, Eurosys '08.

[80]  James C. Lester,et al.  Diagrammatic Student Models: Modeling Student Drawing Performance with Deep Learning , 2015, UMAP.

[81]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[82]  Zhenhuan Gong,et al.  PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[83]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.

[84]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[85]  Alexander S. Szalay,et al.  JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[86]  GhemawatSanjay,et al.  The Google file system , 2003 .

[87]  Lei Ying,et al.  A throughput optimal algorithm for map task scheduling in mapreduce with data locality , 2013, PERV.

[88]  D Ravi,et al.  Knowledge Sharing in the Online Social Network of Yahoo ! Answers and Its Implications , 2016 .

[89]  Seung Ryoul Maeng,et al.  Locality-aware dynamic VM reconfiguration on MapReduce clouds , 2012, HPDC '12.

[90]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[91]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[92]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[93]  Linus Schrage,et al.  The Queue M/G/1 with the Shortest Remaining Processing Time Discipline , 1966, Oper. Res..

[94]  S. Houghten,et al.  There is no (46, 6, 1) block design* , 2001 .

[95]  Lorenz T. Biegler,et al.  On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming , 2006, Math. Program..

[96]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[97]  Suman Nath,et al.  Availability of multi-object operations , 2006 .

[98]  Lei Yu,et al.  Question Quality Analysis and Prediction in Community Question Answering Services with Coupled Mutual Reinforcement , 2017, IEEE Transactions on Services Computing.

[99]  Mor Harchol-Balter,et al.  Size-based scheduling to improve web performance , 2003, TOCS.

[100]  Srimat T. Chakradhar,et al.  ValuePack: Value-based scheduling framework for CPU-GPU clusters , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[101]  Guohong Cao,et al.  Fine-grained mobility characterization: steady and transient state behaviors , 2010, MobiHoc '10.

[102]  Chris Chatfield,et al.  The Analysis of Time Series , 1990 .

[103]  Zongpeng Li,et al.  An Online Auction Framework for Dynamic Resource Provisioning in Cloud Computing , 2016, IEEE/ACM Transactions on Networking.

[104]  T. V. Lakshman,et al.  Optimizing data access latencies in cloud systems by intelligent virtual machine placement , 2013, 2013 Proceedings IEEE INFOCOM.

[105]  Srinivasan Seshan,et al.  Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems , 2006, NSDI.

[106]  Karl Aberer,et al.  Autonomic SLA-Driven Provisioning for Cloud Applications , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[107]  Saloni Jain,et al.  Efficient Optimal Algorithm of Task Scheduling in Cloud Computing Environment , 2014, ArXiv.

[108]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[109]  Eric Bouillet,et al.  Efficient resource provisioning in compute clouds via VM multiplexing , 2010, ICAC '10.

[110]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[111]  Husnu S. Narman,et al.  Characterizing Data Deliverability of Greedy Routing in Wireless Sensor Networks , 2015, IEEE Transactions on Mobile Computing.

[112]  Kang G. Shin,et al.  Preempt a Job or Not in EDF Scheduling of Uniprocessor Systems , 2014, IEEE Transactions on Computers.

[113]  Anne-Marie Kermarrec,et al.  Archiving cold data in warehouses with clustered network coding , 2014, EuroSys '14.

[114]  Yadong Mu,et al.  Supervised deep learning with auxiliary networks , 2014, KDD.

[115]  Xiaohui Gu,et al.  CloudScale: elastic resource scaling for multi-tenant cloud systems , 2011, SoCC.

[116]  Douglas Thain,et al.  The quest for scalable support of data-intensive workloads in distributed systems , 2009, HPDC '09.

[117]  Patrick Wendell,et al.  Batch Sampling : Low Overhead Scheduling for Sub-Second Parallel Jobs , 2012 .

[118]  Scott Shenker,et al.  The Case for Tiny Tasks in Compute Clusters , 2013, HotOS.

[119]  Peter J. Varman,et al.  Defragmenting the cloud using demand-based resource allocation , 2013, SIGMETRICS '13.

[120]  Paul Marshall,et al.  Improving Utilization of Infrastructure Clouds , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[121]  V. P. Anuradha,et al.  A survey on resource allocation strategies in cloud computing , 2014, International Conference on Information Communication and Embedded Systems (ICICES2014).

[122]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[123]  Adel Javanmard,et al.  Versatile refresh: low complexity refresh scheduling for high-throughput multi-banked eDRAM , 2012, SIGMETRICS '12.