Stochastic -Model -Driven Adaptation and Recovery in Distributed Systems

[1]  Karsten Schwan,et al.  E2EProf: Automated End-to-End Performance Management for Enterprise Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[2]  George E. Monahan,et al.  A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .

[3]  Matti A. Hiltunen,et al.  Adaptive Distributed and Fault-Tolerant Systems , 2007 .

[4]  Sang Hyuk Son,et al.  Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers , 2006, IEEE Transactions on Parallel and Distributed Systems.

[5]  Marcos K. Aguilera,et al.  WAP5: black-box performance debugging for wide-area systems , 2006, WWW '06.

[6]  Architecture-based autonomous repair management: an application to J2EE clusters , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[7]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[8]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[9]  Michael L. Littman,et al.  An Instance-Based State Representation for Network Repair , 2004, AAAI.

[10]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[11]  Mike Chen,et al.  Failure diagnosis using decision trees , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[12]  Sui Ruan,et al.  On multi-mode test sequencing problem , 2003, Proceedings AUTOTESTCON 2003. IEEE Systems Readiness Technology Conference..

[13]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[14]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[15]  Rittwik Jana,et al.  iMobile EE – An Enterprise Mobile Service Platform , 2003, Wirel. Networks.

[16]  Anne Condon,et al.  On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[17]  George Candea,et al.  JAGR: an autonomous self-recovering application server , 2003, 2003 Autonomic Computing Workshop.

[18]  George Candea,et al.  Automatic failure-path inference: a generic introspection technique for Internet applications , 2003, Proceedings the Third IEEE Workshop on Internet Applications. WIAPP 2003.

[19]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[20]  Willy Zwaenepoel,et al.  Performance and scalability of EJB applications , 2002, OOPSLA '02.

[21]  Chenyang Lu,et al.  An adaptive control framework for QoS guarantees and its application to differentiated caching , 2002, IEEE 2002 Tenth IEEE International Workshop on Quality of Service (Cat. No.02EX564).

[22]  Chenyang Lu,et al.  ControlWare: a middleware architecture for feedback control of software performance , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[23]  George Candea,et al.  Reducing recovery time in a small recursively restartable system , 2002, Proceedings International Conference on Dependable Systems and Networks.

[24]  Noah Treuhaft,et al.  ROC-1: Hardware Support for Recovery-Oriented Computing , 2002, IEEE Trans. Computers.

[25]  K. Shin,et al.  Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach , 2002, IEEE Trans. Parallel Distributed Syst..

[26]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[27]  Joseph L. Hellerstein,et al.  Using Control Theory to Achieve Service Level Objectives In Performance Management , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[28]  Tarek F. Abdelzaher,et al.  Differentiated caching services; a control-theoretical approach , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[29]  William LeFebvre,et al.  CNN.com: Facing a World Crisis , 2001, LiSA.

[30]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[31]  Klara Nahrstedt,et al.  A control-based middleware framework for quality-of-service adaptations , 1999, IEEE J. Sel. Areas Commun..

[32]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[33]  Calton Pu,et al.  A feedback-driven proportion allocator for real-rate scheduling , 1999, OSDI '99.

[34]  Priya Narasimhan,et al.  Transparent fault tolerance for corba , 1999 .

[35]  William H. Sanders,et al.  AQuA: an adaptive architecture that provides dependable distributed objects , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[36]  Calton Pu,et al.  SWiFT: a feedback control and dynamic reconfiguration toolkit , 1998 .

[37]  Saurabh Bagchi,et al.  Chameleon: a software infrastructure for adaptive fault tolerance , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[38]  Hermann de Meer,et al.  Controlled Stochastic Petri Nets , 1997, SRDS.

[39]  Richard Washington,et al.  BI-POMDP: Bounded, Incremental, Partially-Observable Markov-Model Planning , 1997, ECP.

[40]  Milos Hauskrecht,et al.  Incremental Methods for Computing Bounds in Partially Observable Markov Decision Processes , 1997, AAAI/IAAI.

[41]  Algirdas Avizienis,et al.  Toward Systematic Design of Fault-Tolerant Systems , 1997, Computer.

[42]  Jean-Claude Laprie,et al.  Dependable computing: concepts, limits, challenges , 1995 .

[43]  Yennun Huang,et al.  A software fault tolerance platform , 1995 .

[44]  Philip Heidelberger,et al.  Fast simulation of rare events in queueing and reliability models , 1993, TOMC.

[45]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[46]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[47]  Kishor S. Trivedi,et al.  Guarded Repair of Dependable Systems , 1994, Theor. Comput. Sci..

[48]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[49]  Andrzej Pelc,et al.  Diagnosis and Repair in Multiprocessor Systems , 1993, IEEE Trans. Computers.

[50]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[51]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[52]  Kang G. Shin,et al.  Optimal Dynamic Control of Resources in a Distributed System , 1989, IEEE Transactions on Software Engineering.

[53]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1989, RFC.

[54]  Sape Mullender,et al.  Distributed systems , 1989 .

[55]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[56]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[57]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[58]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[59]  Karl N. Levitt,et al.  The design, analysis, and verification of the SIFT fault tolerant system , 1976, ICSE '76.

[60]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[61]  W. C. Carter,et al.  Reliability modeling techniques for self-repairing computer systems , 1969, ACM '69.

[62]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[63]  Algirdas Avizienis,et al.  Design of fault-tolerant computers , 1967, AFIPS '67 (Fall).