Predicting and Mitigating Jobs Failures in Big Data Clusters
暂无分享,去创建一个
[1] Yuanyuan Zhou,et al. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.
[2] Navendu Jain,et al. Demystifying the dark side of the middle: a field study of middlebox failures in datacenters , 2013, Internet Measurement Conference.
[3] Andrea Rosà,et al. Quantifying the Brown Side of Priority Schedulers: Lessons from Big Clusters , 2014, PERV.
[4] Sheng Di,et al. Characterization and Comparison of Cloud versus Grid Workloads , 2012, 2012 IEEE International Conference on Cluster Computing.
[5] Franck Cappello,et al. Characterizing Cloud Applications on a Google Data Center , 2013, 2013 42nd International Conference on Parallel Processing.
[6] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[7] Andrew S. Tanenbaum,et al. Distributed systems: Principles and Paradigms , 2001 .
[8] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[9] Andrea Rosà,et al. Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[10] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[11] P. Bak,et al. Learning from mistakes , 1997, Neuroscience.
[12] Luiz André Barroso,et al. Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.
[13] Yang Liu,et al. Be conservative: enhancing failure diagnosis with proactive logging , 2012, OSDI 2012.
[14] Kashi Venkatesh Vishwanath,et al. Characterizing cloud computing hardware reliability , 2010, SoCC '10.
[15] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .
[16] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[17] Ilias Iliadis,et al. Effect of Latent Errors on the Reliability of Data Storage Systems , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.
[18] HölzleUrs,et al. Web Search for a Planet , 2003 .
[19] Rahul Potharaju,et al. When the network crumbles: an empirical study of cloud network failures and their impact on services , 2013, SoCC.
[20] Sangyeun Cho,et al. Characterizing Machines and Workloads on a Google Cluster , 2012, 2012 41st International Conference on Parallel Processing Workshops.
[21] Bianca Schroeder,et al. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.
[22] Yuanyuan Zhou,et al. Efficient online validation with delta execution , 2009, ASPLOS.
[23] Laura L. Pullum,et al. Software Fault Tolerance Techniques and Implementation , 2001 .
[24] Yuanyuan Zhou,et al. Sweeper: a lightweight end-to-end system for defending against fast worms , 2007, EuroSys '07.
[25] Nasser M. Nasrabadi,et al. Pattern Recognition and Machine Learning , 2006, Technometrics.
[26] Randy H. Katz,et al. Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.
[27] Evangelos Eleftheriou,et al. Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.
[28] Robert Birke,et al. Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[29] Van-Anh Truong,et al. Availability in Globally Distributed Storage Systems , 2010, OSDI.
[30] Cristina L. Abad,et al. Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters , 2013, SoCC.
[31] Ilias Iliadis. Reliability modeling of RAID storage systems with latent errors , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.
[32] Andrea Rosà,et al. Demystifying Casualties of Evictions in Big Data Priority Scheduling , 2015, PERV.