Utility-Driven Proactive Management of Availability in Enterprise-Scale Information Flows

Enterprises rely critically on the timely and sustained delivery of information. To support this need, we augment information flow middleware with new functionality that provides high levels of availability to distributed applications while at the same time maximizing the utility end users derive from such information. Specifically, the paper presents utility-driven ‘proactive availability-management' techniques to offer (1) information flows that dynamically self-determine their availability requirement based on high-level utility specifications, (2) flows that can trade recovery time for performance based on the ‘perceived' stability of and failure predictions (early alarm) for the underlying system, and (3) methods, based on real-world case studies, to deal with both transient and non-transient failures. Utility-driven ‘proactive availability-management' is integrated into information flow middleware and used with representative applications. Experiments reported in the paper demonstrate middleware capability to self-determine availability guarantees, to offer improved performance versus a statically configured system, and to be resilient to a wide range of faults.

[1]  Irving L. Traiger,et al.  The Recovery Manager of the System R Database Manager , 1981, CSUR.

[2]  David R. Cheriton,et al.  Understanding the limitations of causally and totally ordered communication , 1994, SOSP '93.

[3]  David A. Patterson,et al.  Combining statistical monitoring and predictable recovery for self-management , 2004, WOSS '04.

[4]  Segev Wasserkrug,et al.  Autonomic self-optimization according to business objectives , 2004 .

[5]  Vibhore Kumar,et al.  Work in Progress: Availability-Aware Self-Configuration in Autonomic Systems , 2004, DSOM.

[6]  Kenny C. Gross,et al.  Early Detection of Signal and Process Anomalies in Enterprise Computing Systems , 2002, ICMLA.

[7]  Rajarshi Das,et al.  Utility functions in autonomic systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[8]  Segev Wasserkrug,et al.  Autonomic self-optimization according to business objectives , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[9]  Maarten van Steen,et al.  Dynamically Adapting Tuple Replication for Managing Availability in a Shared Data Space , 2005, COORDINATION.

[10]  Thomas Friese,et al.  Self-healing Execution of Business Processes Based on a Peer-to-Peer Service Architecture , 2005, ARCS.

[11]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[12]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[13]  Bettina Kemme,et al.  Fault-tolerance for stateful application servers in the presence of advanced transactions patterns , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[14]  Santosh K. Shrivastava,et al.  The Design and Implementation of Arjuna , 1995, Comput. Syst..

[15]  Peng Liu,et al.  FIMD-MPI: a tool for injecting faults into MPI application , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[16]  Karsten Schwan,et al.  Implementing Diverse Messaging Models with Self-Managing Properties using IFLOW , 2006, 2006 IEEE International Conference on Autonomic Computing.

[17]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[18]  Moorsel A Van,et al.  Design of a Resource Manager for Fault-Tolerant CORBA , 1999 .

[19]  Karsten Schwan,et al.  I-RMI: Performance Isolation in Information Flow Applications , 2005, Middleware.

[20]  Ada Gavrilovska,et al.  A practical approach for 'zero' downtime in an operational information system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[21]  Amin Vahdat,et al.  The costs and limits of availability for replicated services , 2001, TOCS.

[22]  Karsten Schwan,et al.  Utility-Driven Availability-Management in Enterprise-Scale Information Flows , 2006 .

[23]  Pierre Sens,et al.  The STAR fault manager for distributed operating environments. design, implementation and performance , 1998 .

[24]  Ellen W. Zegura,et al.  How to model an internetwork , 1996, Proceedings of IEEE INFOCOM '96. Conference on Computer Communications.

[25]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[26]  Florian Schintke,et al.  Modeling Replica Availability in Large Data Grids , 2005, Journal of Grid Computing.

[27]  Pierre Sens,et al.  STAR: a fault-tolerant system for distributed applications , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[28]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[29]  Salim Hariri,et al.  Autonomic Computing : Concepts, Infrastructure, and Applications , 2006 .

[30]  Edward D. Lazowska,et al.  An efficient and highly available read-one write-all protocol for replicated data management , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[31]  R. E. Strom,et al.  Fault-tolerance in the SMILE stateful publish-subscribe system , 2004, ICSE 2004.

[32]  N. Zavaljevski,et al.  Uncertainty analysis for multivariate state estimation in safety-critical and mission-critical maintenance applications , 2000 .