Application Cluster Service Scheme for Near-Zero-Downtime Services

The required reliability in applications of a distributed computer system is continuous service for 24 hours a day, 7 days a week. However, computer failures due to exhaustion of operating system resources, data corruption, numerical error accumulation, and so on, may interrupt services and cause significant losses. Hence, this work proposes an application cluster service (APCS) scheme. The proposed APCS provides both a failover scheme and a state recovery scheme for failure management. The failover scheme is designed mainly to automatically activate the backup application for replacing the failed application whenever it is sick or down. Meanwhile, the state recovery scheme is intended primarily to provide an inheritable design pattern to support applications with state recovery requirements. An application simply needs to inherit and implement this design pattern, and then can accomplish the task of state backup and recovery. Furthermore, a performance evaluator (PEV) that can detect performance degradation and predict time to failure is developed in this study. By using these detection and prediction capabilities, the APCS can perform the failover process before node breakdown. Thus, applying APCS and PEV can enable a distributed computer system to provide services with near-zero-downtime.

[1]  Raghu V. Hudli,et al.  CORBA fundamentals and programming , 1996 .

[2]  Rod Gamache,et al.  Windows NT Clustering Service , 1998, Computer.

[3]  R. G. Krutchkoff,et al.  Classical and Inverse Regression Methods of Calibration , 1967 .

[4]  Norman R. Draper,et al.  Applied regression analysis (2. ed.) , 1981, Wiley series in probability and mathematical statistics.

[5]  Tao Yang,et al.  Clustering Support and Replication Management for Scalable Network Services , 2003, IEEE Trans. Parallel Distributed Syst..

[6]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[7]  James Won-Ki Hong,et al.  Highly available and efficient load cluster management system using SNMP and Web , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[8]  Andrzej Bargiela,et al.  FADI: a fault tolerant environment for open distributed computing , 2000, IEE Proc. Softw..

[9]  Aniruddha S. Gokhale,et al.  DOORS: towards high-performance fault tolerant CORBA , 2000, Proceedings DOA'00. International Symposium on Distributed Objects and Applications.

[10]  Floyd Piedad,et al.  High Availability: Design, Techniques and Processes , 2000 .

[11]  Kishor S. Trivedi,et al.  A methodology for detection and estimation of software aging , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[12]  Evan Marcus,et al.  Blueprints for high availability: designing resilient distributed systems , 2000 .

[13]  Tong Liu,et al.  Availability prediction and modeling of high mobility OSCAR cluster , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[14]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[15]  Ira Pramanick,et al.  High Availability , 2001, Int. J. High Perform. Comput. Appl..

[16]  Haw Ching Yang,et al.  Development of a service management scheme for semiconductor factory management systems , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[17]  Haw Ching Yang,et al.  Essential SMSs: developing a service management scheme for semiconductor factory management systems , 2004, IEEE Robotics Autom. Mag..

[18]  Jon Sigel,et al.  CORBA Fundamentals and Programming , 1996 .