Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures
暂无分享,去创建一个
[1] Bernd Fritzke,et al. A Growing Neural Gas Network Learns Topologies , 1994, NIPS.
[2] Grenville J. Armitage,et al. A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.
[3] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..
[4] Christian Engelmann,et al. Super-Scalable Algorithms for Computing on 100, 000 Processors , 2005, International Conference on Computational Science.
[5] Zizhong Chen. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[6] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[7] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[8] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..
[9] Ron Brightwell,et al. Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.
[10] Ke Wang,et al. Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.
[11] Vern Paxson,et al. Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.
[12] John Shalf,et al. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.
[13] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.
[14] Steve Plimpton,et al. Fast parallel algorithms for short-range molecular dynamics , 1993 .
[15] Yasushi Saito,et al. Optimistic replication , 2005, CSUR.
[16] Thomas Hérault,et al. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[17] Franck Cappello,et al. Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.
[18] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[19] J. M. McGlaun,et al. CTH: A software family for multi-dimensional shock physics analysis , 1995 .
[20] Carsten Willems,et al. Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..
[21] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.
[22] R. Kondor,et al. Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. , 2009, Physical review letters.
[23] Zizhong Chen,et al. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[24] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.
[25] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[26] David R. Karger,et al. Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.