Self-awareness of Cloud Applications

Cloud applications today deliver an increasingly larger portion of the information and communications technology (ICT) services. To address the scale, growth, and reliability of cloud applications, self-aware management and scheduling are becoming commonplace. How are they used in practice? In this chapter, we propose a conceptual framework for analyzing the state-of-the-art self-awareness approaches used in the context of cloud applications. We map important applications corresponding to the popular and emerging application domains to this conceptual framework and compare the practical characteristics, benefits, and drawbacks of self-awareness approaches. Last, we propose a road map for addressing the open challenges in self-aware cloud and datacenter applications.

[1]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[2]  Arif Merchant,et al.  Janus: Optimal Flash Provisioning for Cloud Storage Workloads , 2013, USENIX Annual Technical Conference.

[3]  Ricardo Bianchini,et al.  System Resilience at Extreme Scale White Paper , 2009 .

[4]  Pierre Sens,et al.  Towards QoS-Oriented SLA Guarantees for Online Cloud Services , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[5]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Alexandru Iosup,et al.  Statistical Characterization of Business-Critical Workloads Hosted in Cloud Datacenters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[7]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[8]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[9]  Morton Swimmer Using the danger model of immune systems for distributed defense in modern data networks , 2007, Comput. Networks.

[10]  Alexandru Iosup,et al.  A new business model for massively multiplayer online games , 2011, ICPE '11.

[11]  Ole J. Mengshoel,et al.  A Constrained Genetic Algorithm for Rebalancing of Services in Cloud Data Centers , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[12]  Alexandru Iosup,et al.  Balanced resource allocations across multiple dynamic MapReduce clusters , 2014, SIGMETRICS '14.

[13]  Ole J. Mengshoel,et al.  Accelerating Bayesian network parameter learning using Hadoop and MapReduce , 2012, BigMine '12.

[14]  Dirk Beyer,et al.  Don't Settle for Less Than the Best: Use Optimization to Make Decisions , 2007, HotOS.

[15]  Ole J. Mengshoel,et al.  MapReduce for Bayesian Network Parameter Learning using the EM Algorithm , 2012 .

[16]  Sara Bouchenak,et al.  Performance, Availability and Cost of Self-Adaptive Internet Services Chapter of Performance and Dependability in Service Computing: Concepts, Techniques and Research Directions , 2011 .

[17]  Daniel Kuhn,et al.  SQPR: Stream query planning with reuse , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[19]  Rouven Krebs,et al.  Resource Usage Control in Multi-tenant Applications , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[20]  Alexandru Iosup,et al.  ExPERT: Pareto-Efficient Task Replication on Grids and a Cloud , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[21]  Jiawei Han,et al.  Troubleshooting interactive complexity bugs in wireless sensor networks using data mining techniques , 2014, TOSN.

[22]  Insup Lee,et al.  Medical Cyber Physical Systems , 2010, Design Automation Conference.

[23]  Insup Lee,et al.  Cyber-physical systems: The next computing revolution , 2010, Design Automation Conference.

[24]  Henri E. Bal,et al.  Cuckoo: A Computation Offloading Framework for Smartphones , 2010, MobiCASE.

[25]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[26]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[27]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[28]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[29]  Mor Harchol-Balter Task assignment with unknown duration , 2002, JACM.

[30]  Karl-Erik Årzén,et al.  Brownout: building more robust cloud applications , 2014, ICSE.

[31]  Song Fu,et al.  Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[32]  Henry Hoffmann,et al.  Controlling software applications via resource allocation within the heartbeats framework , 2010, 49th IEEE Conference on Decision and Control (CDC).

[33]  Ole J. Mengshoel,et al.  A Tutorial on Bayesian Networks for System Health Management , 2011 .

[34]  Erik Elmroth,et al.  Improving Cloud Service Resilience Using Brownout-Aware Load-Balancing , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.

[35]  Sujit Dey,et al.  Content-Aware Modeling and Enhancing User Experience in Cloud Mobile Rendering and Streaming , 2014, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[36]  Timothy Lethbridge,et al.  A taxonomy of software types to facilitate search and evidence-based software engineering , 2008, CASCON '08.

[37]  Kun-Lung Wu,et al.  SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems , 2008, Middleware.

[38]  Kang G. Shin,et al.  Automated control of multiple virtualized resources , 2009, EuroSys '09.

[39]  Lui Sha,et al.  Cyber-Physical Systems: A New Frontier , 2008, 2008 IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (sutc 2008).

[40]  Insup Lee,et al.  Model-Driven Safety Analysis of Closed-Loop Medical Systems , 2014, IEEE Transactions on Industrial Informatics.

[41]  Samir Chatterjee,et al.  A Design Science Research Methodology for Information Systems Research , 2008 .

[42]  Ole J. Mengshoel,et al.  Mobile Computing: Challenges and Opportunities for Autonomy and Feedback , 2013, Feedback Computing.

[43]  Kang G. Shin,et al.  Maestro: quality-of-service in large disk arrays , 2011, ICAC '11.

[44]  Dirk Beyer,et al.  On the road to recovery: restoring data after disasters , 2006, EuroSys '06.

[45]  Ian T. Foster,et al.  Globus Online: Accelerating and Democratizing Science through Cloud-Based Services , 2011, IEEE Internet Computing.

[46]  Mor Harchol-Balter,et al.  AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers , 2012, TOCS.

[47]  Hsuan-Tien Lin,et al.  Learning From Data , 2012 .

[48]  Ole J. Mengshoel,et al.  Towards Real-Time, On-Board, Hardware-Supported Sensor and Software Health Management for Unmanned Aerial Systems , 2015 .

[49]  Alexandru Iosup,et al.  KOALA-C: A task allocator for integrated multicluster and multicloud environments , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[50]  Alexandru Iosup,et al.  The BTWorld use case for big data analytics: Description, MapReduce logical workflow, and empirical evaluation , 2013, 2013 IEEE International Conference on Big Data.

[51]  Mahantesh Halappanavar,et al.  Codesign Lessons Learned from Implementing Graph Matching on Multithreaded Architectures , 2015, Computer.

[52]  Fan Yang,et al.  Proteus: Power Proportional Memory Cache Cluster in Data Centers , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[53]  Alexandru Iosup,et al.  Scheduling Jobs in the Cloud Using On-Demand and Reserved Instances , 2013, Euro-Par.

[54]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[55]  Erol Gelenbe,et al.  Emergency Cyber-Physical-Human Systems , 2012, 2012 21st International Conference on Computer Communications and Networks (ICCCN).

[56]  Alexandru Iosup,et al.  A Model for Space-Correlated Failures in Large-Scale Distributed Systems , 2010, Euro-Par.

[57]  Carlo Ghezzi,et al.  Self-adaptive software meets control theory: A preliminary approach supporting reliability requirements , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[58]  Samuel Kounev,et al.  Runtime Vertical Scaling of Virtualized Applications via Online Model Estimation , 2014, 2014 IEEE Eighth International Conference on Self-Adaptive and Self-Organizing Systems.

[59]  Ole J. Mengshoel,et al.  Diagnosis for uncertain, dynamic and hybrid domains using Bayesian networks and arithmetic circuits , 2014, Int. J. Approx. Reason..

[60]  Erik Elmroth,et al.  Control-theoretical load-balancing for cloud applications with brownout , 2014, 53rd IEEE Conference on Decision and Control.

[61]  I. D. Landau,et al.  Digital Control Systems: Design, Identification and Implementation , 2006 .

[62]  Alexandru Iosup,et al.  How Well Do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[63]  Ole J. Mengshoel,et al.  Probabilistic Model-Based Diagnosis: An Electrical Power System Case Study , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[64]  Alexandru Iosup,et al.  Grid Computing Workloads , 2011, IEEE Internet Computing.

[65]  Alexandru Iosup,et al.  Towards a workload model for online social applications: ICPE 2013 work-in-progress paper , 2013, ICPE '13.

[66]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[67]  Amrudin Agovic,et al.  Discriminative Topic Models , 2016 .

[68]  Katinka Wolter,et al.  Reducing Task Completion Time in Mobile Offloading Systems through Online Adaptive Local Restart , 2015, ICPE.

[69]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[70]  Yingwei Luo,et al.  Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.

[71]  Franck Cappello,et al.  BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds , 2013, J. Parallel Distributed Comput..

[72]  José M. F. Moura,et al.  Modeling of Future Cyber–Physical Energy Systems for Distributed Sensing and Control , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[73]  Steven Hand,et al.  CIEL: A Universal Execution Engine for Distributed Data-Flow Computing , 2011, NSDI.

[74]  Walid G. Aref,et al.  SINA: scalable incremental processing of continuous queries in spatio-temporal databases , 2004, SIGMOD '04.

[75]  Themistoklis Charalambous,et al.  Overload Management in Data Stream Processing Systems with Latency Guarantees , 2012, ICAC 2012.

[76]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[77]  Gokcen Kestor,et al.  Online Monitoring System for Performance Fault Detection , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[78]  Kaijun Ren,et al.  Exploring portfolio scheduling for long-term execution of scientific workloads in IaaS clouds , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[79]  Alexandru Iosup,et al.  Analysis and modeling of time-correlated failures in large-scale distributed systems , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[80]  Steven Hand,et al.  Self-adaptive and self-configured CPU resource provisioning for virtualized servers using Kalman filters , 2009, ICAC '09.

[81]  Alexandru Iosup,et al.  Extending the Capabilities of Mobile Devices for Online Social Applications through Cloud Offloading , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[82]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[83]  Arshad Jhumka,et al.  Linking Resource Usage Anomalies with System Failures from Cluster Log Data , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[84]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.

[85]  Thomas Vogel,et al.  Software Engineering Meets Control Theory , 2015, 2015 IEEE/ACM 10th International Symposium on Software Engineering for Adaptive and Self-Managing Systems.

[86]  中村 正治,et al.  Stochastic reliability modeling, optimization and applications , 2010 .

[87]  Lui Sha,et al.  The Simplex Reference Model: Limiting Fault-Propagation Due to Unreliable Components in Cyber-Physical System Architectures , 2007, RTSS 2007.

[88]  Samuel Kounev,et al.  Model-Based Autonomic and Performance-Aware System Adaptation in Heterogeneous Resource Environments: A Case Study , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[89]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[90]  Themistoklis Charalambous,et al.  A min-max framework for CPU resource provisioning in virtualized servers using ℋ∞ Filters , 2010, 49th IEEE Conference on Decision and Control (CDC).

[91]  Kimberly Keeton,et al.  A framework for evaluating storage system dependability , 2004, International Conference on Dependable Systems and Networks, 2004.

[92]  Xiaoyun Zhu,et al.  Application-driven dynamic vertical scaling of virtual machines in resource pools , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[93]  Raul Castro Fernandez,et al.  Making State Explicit for Imperative Big Data Processing , 2014, USENIX Annual Technical Conference.

[94]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[95]  Fiona Fui-Hoon Nah,et al.  A study on tolerable waiting time: how long are Web users willing to wait? , 2004, AMCIS.

[96]  Insup Lee,et al.  Rationale and Architecture Principles for Medical Application Platforms , 2012, 2012 IEEE/ACM Third International Conference on Cyber-Physical Systems.

[97]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[98]  Pierre Sens,et al.  SLA guarantees for cloud services , 2016, Future Gener. Comput. Syst..

[99]  J. A. Konstan,et al.  Recommended for you , 2012, IEEE Spectrum.

[100]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[101]  Alexandru Iosup,et al.  Dynamic Resource Provisioning in Massively Multiplayer Online Games , 2011, IEEE Transactions on Parallel and Distributed Systems.

[102]  Alexandru Iosup,et al.  Self-Expressive Management of Business-Critical Workloads in Virtualized Datacenters , 2015, Computer.

[103]  Alexandru Iosup,et al.  An Availability-on-Demand Mechanism for Datacenters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[104]  Ajay Gulati VMware distributed resource Management : design , Implementation , and lessons learned , 2022 .

[105]  Calton Pu,et al.  vPerfGuard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments , 2013, ICPE '13.

[106]  Lui Sha,et al.  The Design of Safe Networked Supervisory Medical Systems Using Organ-Centric Hierarchical Control Architecture , 2015, IEEE Journal of Biomedical and Health Informatics.

[107]  Stephanie Forrest,et al.  Learning DFA representations of HTTP for protecting web applications , 2007, Comput. Networks.

[108]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[109]  Klara Nahrstedt,et al.  Predictive data and energy management in GreenHDFS , 2011, 2011 International Green Computing Conference and Workshops.

[110]  Alexandru Iosup,et al.  V for Vicissitude: The Challenge of Scaling Complex Big Data Workflows , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[111]  Andreas Buja,et al.  Recommender systems and their effects on consumers: the fragmentation debate , 2010, EC '10.

[112]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[113]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[114]  Gail E. Kaiser,et al.  A control theory foundation for self-managing computing systems , 2005, IEEE Journal on Selected Areas in Communications.

[115]  Tarek F. Abdelzaher,et al.  AdaptGuard: guarding adaptive systems from instability , 2009, ICAC '09.

[116]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[117]  Xi Fang,et al.  3. Full Four-channel 6.3-gb/s 60-ghz Cmos Transceiver with Low-power Analog and Digital Baseband Circuitry 7. Smart Grid — the New and Improved Power Grid: a Survey , 2022 .

[118]  Jiawei Han,et al.  Power watermarking: Facilitating power-based diagnosis of node silence in remote high-end sensing systems , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[119]  Fei-Yue Wang,et al.  Data-Driven Intelligent Transportation Systems: A Survey , 2011, IEEE Transactions on Intelligent Transportation Systems.

[120]  Henry Hoffmann,et al.  Automated design of self-adaptive software with control-theoretical formal guarantees , 2014, Software Engineering & Management.

[121]  Paolo Costa,et al.  Exploiting Time-Malleability in Cloud-based Batch Processing Systems , 2013 .

[122]  James R. Hamilton,et al.  On Designing and Deploying Internet-Scale Services , 2007, LISA.

[123]  Irfan Ahmad,et al.  BASIL: Automated IO Load Balancing Across Storage Devices , 2010, FAST.

[124]  Wu-chi Feng,et al.  Automatic high-performance reconstruction and recovery , 2007, Comput. Networks.

[125]  Ragunathan Rajkumar,et al.  Parallel scheduling for cyber-physical systems: Analysis and case study on a self-driving car , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).

[126]  Mihaly Berekmeri,et al.  A Control Approach for Performance of Big Data Systems , 2014 .

[127]  Meiyappan Nagappan,et al.  Modeling cloud failure data: a case study of the virtual computing lab , 2011, SECLOUD '11.

[128]  Michael I. Jordan,et al.  Characterizing, modeling, and generating workload spikes for stateful services , 2010, SoCC '10.

[129]  Chonho Lee,et al.  A survey of mobile cloud computing: architecture, applications, and approaches , 2013, Wirel. Commun. Mob. Comput..

[130]  William H. Sanders,et al.  Designing dependable storage solutions for shared application environments , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[131]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[132]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.