Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing
暂无分享,去创建一个
[1] S. Venkatesan,et al. Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.
[2] Fred B. Schneider. What good are models and what models are good , 1993 .
[3] Borivoje Nikolic,et al. Opportunities for Fine-Grained Adaptive Voltage Scaling to Improve System-Level Energy Efficiency , 2015 .
[4] Ricardo Bianchini,et al. Conserving disk energy in network servers , 2003, ICS '03.
[5] André Schiper,et al. Replication for send-deterministic MPI HPC applications , 2013, FTXS '13.
[6] Jong Kim,et al. Probabilistic checkpointing , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.
[7] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[8] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.
[9] Daniel Marques,et al. Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, IPDPS.
[10] Christian Engelmann,et al. Redundant Execution of HPC Applications with MR-MPI , 2011 .
[11] Louise E. Moser,et al. Fault Tolerance Middleware for Cloud Computing , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.
[12] Stijn Eyerman,et al. Fine-grained DVFS using on-chip regulators , 2011, TACO.
[13] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.
[14] Bryan Mills,et al. Power-aware resilience for exascale computing , 2014 .
[15] Randy H. Katz,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.
[16] Joel F. Bartlett,et al. A NonStop kernel , 1981, SOSP.
[17] Christian Engelmann,et al. Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.
[18] Kai Li,et al. Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.
[19] Sara Bouchenak,et al. Benchmarking Dependability of MapReduce Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.
[20] Luís Moura Silva,et al. Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..
[21] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[22] Henri Casanova,et al. Combining Process Replication and Checkpointing for Resilience on Exascale Systems , 2012 .
[23] Roberto Baldoni,et al. Total Order Communications: A Practical Analysis , 2005, EDCC.
[24] Laxmikant V. Kalé,et al. Energy profile of rollback-recovery strategies in high performance computing , 2014, Parallel Comput..
[25] James H. Laros,et al. Redundant computing for exascale systems. , 2010 .
[26] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[27] Gernot Heiser,et al. Dynamic voltage and frequency scaling: the laws of diminishing returns , 2010 .
[28] Vincent K. N. Lau,et al. Automatic Performance Setting for Dynamic Voltage Scaling , 2002, Wirel. Networks.
[29] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[30] Felix C. Gärtner,et al. Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.
[31] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[32] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[33] G. Sohi,et al. A static power model for architects , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.
[34] Dakai Zhu,et al. Global Reliability-Aware Power Management for Multiprocessor Real-Time Systems , 2010, 2010 IEEE 16th International Conference on Embedded and Real-Time Computing Systems and Applications.
[35] Laxmikant V. Kalé,et al. Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[36] Paulo Veríssimo,et al. Resilient state machine replication , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).
[37] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[38] Hong Zhu,et al. A survey of practical algorithms for suffix tree construction in external memory , 2010 .
[39] Wu-chun Feng,et al. A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[40] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.
[41] George Bosilca,et al. High Performance RDMA Protocols in HPC , 2006, PVM/MPI.
[42] Franck Cappello,et al. Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.
[43] Israel Koren,et al. Fault-Tolerant Systems , 2007 .
[44] Luiz André Barroso,et al. The Price of Performance , 2005, ACM Queue.
[45] Flaviu Cristian,et al. Understanding fault-tolerant distributed systems , 1991, CACM.
[46] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[47] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[48] Rami G. Melhem,et al. Maximizing rewards for real-time applications with energy constraints , 2003, TECS.
[49] Dirk Grunwald,et al. Massive Arrays of Idle Disks For Storage Archives , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[50] Wei-Tek Tsai,et al. Service Replication Strategies with MapReduce in Clouds , 2011, 2011 Tenth International Symposium on Autonomous Decentralized Systems.
[51] Ricardo Bianchini,et al. Exploiting redundancy to conserve energy in storage systems , 2006, SIGMETRICS '06/Performance '06.
[52] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[53] Taieb Znati,et al. Shadow Replication: An Energy-Aware, Fault-Tolerant Computational Model for Green Cloud Computing , 2014 .
[54] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[55] Heon Young Yeom,et al. An efficient algorithm for causal message logging , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).
[56] Rami G. Melhem,et al. Adaptive and Power-Aware Resilience for Extreme-Scale Computing , 2016, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld).
[57] Erik Seligman,et al. Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..
[58] George Bosilca,et al. Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..
[59] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[60] Lorenzo Alvisi,et al. An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).
[61] Laurent Broto,et al. Approaches to cloud computing fault tolerance , 2012, 2012 International Conference on Computer, Information and Telecommunication Systems (CITS).
[62] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..
[63] Shubhendu S. Mukherjee,et al. Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[64] David Fiala. Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[65] James H. Laros,et al. Does partial replication pay off? , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[66] E. N. Elnozahy. How safe is probabilistic checkpointing? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[67] Xiaohui Gu,et al. Understanding Real World Data Corruptions in Cloud Systems , 2015, 2015 IEEE International Conference on Cloud Engineering.
[68] Sean W. Smith,et al. Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback , 1995, Proceedings 15th Symposium on Reliable Distributed Systems.
[69] Franck Cappello,et al. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[70] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[71] Qin Zheng. Improving MapReduce fault tolerance in the cloud , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[72] Thomas Hérault,et al. Failure Detection and Propagation in HPC systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[73] Ten-Hwang Lai,et al. On Distributed Snapshots , 1987, Inf. Process. Lett..
[74] Yuval Tamir,et al. ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .
[75] Dakai Zhu,et al. Reliability-aware Dynamic Voltage Scaling for energy-constrained real-time embedded systems , 2008, 2008 IEEE International Conference on Computer Design.
[76] Ian Karlin,et al. LULESH 2.0 Updates and Changes , 2013 .
[77] V. Rajaraman,et al. A survey of checkpointing algorithms for parallel and distributed computers , 2000 .
[78] Kang G. Shin,et al. Real-time dynamic voltage scaling for low-power embedded operating systems , 2001, SOSP.
[79] Yuanyuan Zhou,et al. DMA-aware memory energy management , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..
[80] Meeta Sharma Gupta,et al. System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.
[81] Lorenzo Alvisi,et al. Trade-offs in implementing causal message logging protocols , 1996, PODC '96.
[82] Minna Palmroth,et al. Topology Aware Process Mapping , 2012, PARA.
[83] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[84] Rong Ge,et al. CPU MISER: A Performance-Directed, Run-Time System for Power-Aware Clusters , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[85] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.
[86] Laxmikant V. Kalé,et al. Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.
[87] E. N. Elnozahy,et al. Energy Conservation Policies for Web Servers , 2003, USENIX Symposium on Internet Technologies and Systems.
[88] Rami G. Melhem,et al. Shadow Computing: An energy-aware fault tolerant computing model , 2014, 2014 International Conference on Computing, Networking and Communications (ICNC).
[89] Chris Fallin,et al. Memory power management via dynamic voltage/frequency scaling , 2011, ICAC '11.
[90] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[91] Robert C. Aitken,et al. Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.
[92] Jeffrey F. Naughton,et al. Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..
[93] David Blaauw,et al. Theoretical and practical limits of dynamic voltage scaling , 2004, Proceedings. 41st Design Automation Conference, 2004..
[94] Kesheng Wu,et al. Scientific Discovery at the Exascale , 2011 .
[95] S. Huang,et al. Energy-Efficient Cluster Computing via Accurate Workload Characterization , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.
[96] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[97] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.
[98] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[99] Rami G. Melhem,et al. Shadows on the Cloud: An Energy-aware, Profit Maximizing Resilience Framework for Cloud Computing , 2014, CLOSER.
[100] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.
[101] Laxmikant V. Kalé,et al. A ‘cool’ way of improving the reliability of HPC machines , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[102] Ananta Tiwari,et al. Green Queue: Customized Large-Scale Clock Frequency Scaling , 2012, 2012 Second International Conference on Cloud and Green Computing.
[103] Sparsh Mittal,et al. Power Management Techniques for Data Centers: A Survey , 2014, ArXiv.
[104] Margaret H. Wright,et al. The opportunities and challenges of exascale computing , 2010 .
[105] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[106] David K. Lowenthal,et al. Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[107] C. V. Ramamoorthy,et al. Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.
[108] David B. Skillicorn,et al. Questions and Answers about BSP , 1997, Sci. Program..
[109] Laxmikant V. Kalé,et al. A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[110] Vanish Talwar,et al. No "power" struggles: coordinated multi-level power management for the data center , 2008, ASPLOS.
[111] D.K. Lowenthal,et al. Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[112] Johan Vounckx,et al. Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback , 1993 .
[113] Achour Mostéfaoui,et al. Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.
[114] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[115] Dakai Zhu,et al. Generalized reliability-oriented energy management for real-time embedded applications , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).
[116] Bran Selic,et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.
[117] William Gropp,et al. Towards a More Complete Understanding of SDC Propagation , 2017, HPDC.
[118] Yi-Min Wang,et al. Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[119] Bo Fang,et al. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures , 2017, HPDC.
[120] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..
[121] Calton Pu,et al. Impact of DVFS on n-tier application performance , 2013, TRIOS@SOSP.
[122] David B. Johnson,et al. Sender-Based Message Logging , 1987 .
[123] Michael Franz,et al. Power reduction techniques for microprocessor systems , 2005, CSUR.
[124] Pankaj Jalote,et al. Fault tolerance in distributed systems , 1994 .
[125] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).
[126] A. Prasad Sistla,et al. Efficient distributed recovery using message logging , 1989, PODC '89.
[127] Franklin T. Luk,et al. An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..
[128] Susanne Albers,et al. Energy-efficient algorithms , 2010, Commun. ACM.
[129] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[130] Satoshi Matsuoka,et al. Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system , 2013, FTXS '13.
[131] Bronis R. de Supinski,et al. Adagio: making DVS practical for complex HPC applications , 2009, ICS.
[132] Laurent Lefèvre,et al. A survey on techniques for improving the energy efficiency of large-scale distributed systems , 2014, ACM Comput. Surv..
[133] Rami G. Melhem,et al. Energy Consumption of Resilience Mechanisms in Large Scale Systems , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[134] Thomas Hérault,et al. Practical scalable consensus for pseudo-synchronous distributed systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.