Fault tolerant MapReduce-MPI for HPC clusters
暂无分享,去创建一个
Pavan Balaji | Xiaobo Zhou | Yanfei Guo | Wesley Bland | P. Balaji | Yanfei Guo | Xiaobo Zhou | Wesley Bland
[1] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[2] Indranil Gupta,et al. Making cloud intermediate data fault-tolerant , 2010, SoCC '10.
[3] Xiaobo Zhou,et al. iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2013, ICAC 2013.
[4] Andrey Tovchigrechko,et al. Parallelizing BLAST and SOM Algorithms with MapReduce-MPI Library , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[5] Message P Forum,et al. MPI: A Message-Passing Interface Standard , 1994 .
[6] Y. Charlie Hu,et al. PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters , 2013, USENIX Annual Technical Conference.
[7] Jack Dongarra,et al. A Proposal for User-Level Failure Mitigation in the MPI-3 Standard , 2012 .
[8] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[9] Anthony Skjellum,et al. MPI/RT-an emerging standard for high-performance real-time systems , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.
[10] Morris A. Jette. Performance Characteristics of Gang Scheduling in Multiprogrammed Environments , 1997, SC.
[11] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[13] Miguel Correia,et al. Byzantine Fault-Tolerant MapReduce: Faults are Not Just Crashes , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.
[14] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.
[15] Steven J. Plimpton,et al. MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..
[16] Changjun Jiang,et al. FlexSlot: Moving Hadoop Into the Cloud with Flexible Slot Management , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Mendel Rosenblum,et al. The design and implementation of a log-structured file system , 1991, SOSP '91.
[18] B R de Supinski,et al. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .
[19] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[20] Magdalena Balazinska,et al. SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.
[21] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[22] Sara Bouchenak,et al. MRBS: Towards Dependability Benchmarking for Hadoop MapReduce , 2012, Euro-Par Workshops.
[23] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[24] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[25] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[26] Larry Rudolph,et al. Gang Scheduling Performance Benefits for Fine-Grain Synchronization , 1992, J. Parallel Distributed Comput..
[27] Hui Liu,et al. High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.
[28] Roy H. Campbell,et al. ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.
[29] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[30] William Gropp,et al. Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..
[31] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[32] Anand Raghunathan,et al. ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters , 2014, USENIX Annual Technical Conference.
[33] Xiaobo Zhou,et al. iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.
[34] Thomas Hérault,et al. A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI , 2012, Euro-Par.
[35] Satish K. Tripathi,et al. Parallel and Distributed Computing Handbook , 1995 .
[36] M. Balazinska,et al. An analysis of Hadoop usage in scientific workloads , 2013 .
[37] Roy H. Campbell,et al. Resource Provisioning Framework for MapReduce Jobs with Performance Goals , 2011, Middleware.