Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs
暂无分享,去创建一个
[1] W. Marsden. I and J , 2012 .
[2] Yoav Freund,et al. The Alternating Decision Tree Learning Algorithm , 1999, ICML.
[3] David W. Hosmer,et al. Applied Logistic Regression , 1991 .
[4] Zhiling Lan,et al. Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid , 2007 .
[5] Darrell D. E. Long,et al. A longitudinal survey of Internet host reliability , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.
[6] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.
[7] Gregory F. Cooper,et al. A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .
[8] Richard Wolski,et al. Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.
[9] Leo Breiman,et al. Random Forests , 2001, Machine Learning.
[10] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .
[11] Pat Langley,et al. Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.
[12] Richard P. Martin,et al. Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.
[13] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[14] Rajeev Thakur,et al. A Fault Diagnosis and Prognosis Service for TeraGrid Clusters , 2007 .
[15] Nithin Nakka,et al. Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems , 2009, HPCS.
[16] Srinivasan Seshan,et al. Subtleties in tolerating correlated failures , 2006 .
[17] Daniel P. Siewiorek,et al. Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .
[18] Zhiling Lan,et al. Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.
[19] Ron Kohavi,et al. The Power of Decision Tables , 1995, ECML.
[20] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[21] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[22] James S. Plank,et al. Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[23] S. Cessie,et al. Ridge Estimators in Logistic Regression , 1992 .
[24] Ravishankar K. Iyer,et al. Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.
[25] Ravishankar K. Iyer,et al. Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.
[26] Ravishankar K. Iyer,et al. Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.