Trua: Efficient Task Replication for Flexible User-defined Availability in Scientific Grids

Failure is inevitable in scientific computing. As scientific applications and facilities increase their scales over the last decades, finding the root cause of a failure can be very complex or at times nearly impossible. Different scientific computing customers have varying availability demands as well as a diverse willingness to pay for availability. In contrast to existing solutions that try to provide higher and higher availability in scientific grids, we propose a model called Task Replication for Userdefined Availability (Trua). Trua provides flexible, user-defined, availability in scientific grids, allowing customers to express their desire for availability to computational providers. Trua differs from existing task replication approaches in two folds. First, it relies on the historic failure information collected from the virtual layer of the scientific grids. The reliability model for the failures can be represented with a bimodal Johnson distribution which is different from any existing distributions. Second, it adopts an anomaly detector to filter out anomalous failures; it additionally adopts novel selection algorithms to mitigate the effects of temporary and spatial correlations of the failures without knowing the root cause of the failures. We apply the Trua on real-world traces collected from the Open Science Grid (OSG). Our results show that the Trua can successfully meet user-defined availability demands.

[1]  Rajkumar Buyya,et al.  Failure-aware resource provisioning for hybrid Cloud infrastructure , 2012, J. Parallel Distributed Comput..

[2]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[3]  Zhe Zhang,et al.  Discovering Job Preemptions in the Open Science Grid , 2018, PEARC.

[4]  Shantenu Jha,et al.  A Comprehensive Perspective on Pilot-Job Systems , 2015, ACM Comput. Surv..

[5]  Jemal H. Abawajy,et al.  Fault-tolerant scheduling policy for grid computing systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[6]  M. Amoon Design of a Fault-Tolerant Scheduling System for Grid Computing , 2011, 2011 Second International Conference on Networking and Distributed Computing.

[7]  Alexandru Iosup,et al.  Analysis and modeling of time-correlated failures in large-scale distributed systems , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[8]  Sudipto Guha,et al.  Robust Random Cut Forest Based Anomaly Detection on Streams , 2016, ICML.

[9]  Indranil Gupta,et al.  On Availability of Intermediate Data in Cloud Computations , 2009, HotOS.

[10]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[11]  K. G. Srinivasa,et al.  Fault-Tolerant Middleware for Grid Computing , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[12]  Zuoning Chen,et al.  A Large-Scale Study of Failures on Petascale Supercomputers , 2018, Journal of Computer Science and Technology.

[13]  Jie Xu,et al.  An Analysis of Failure-Related Energy Waste in a Large-Scale Cloud Environment , 2014, IEEE Transactions on Emerging Topics in Computing.

[14]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[15]  Kenli Li,et al.  An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems , 2014, Cluster Computing.

[16]  Dimitrios Skoutas,et al.  Efficient task replication and management for adaptive fault tolerance in Mobile Grid environments , 2007, Future Gener. Comput. Syst..

[17]  Alessandro Cilardo,et al.  Enabling HPC for QoS-sensitive applications: The MANGO approach , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Alexandru Iosup,et al.  On the dynamic resource availability in grids , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[19]  Song Fu,et al.  Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[20]  Franck Cappello,et al.  Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System , 2019, IEEE Transactions on Parallel and Distributed Systems.

[21]  Igor Sfiligoi,et al.  glideinWMS - A generic pilot-based Workload Management System , 2008 .