Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home

In the age of cloud, Grid, P2P, and volunteer distributed computing, large-scale systems with tens of thousands of unreliable hosts are increasingly common. Invariably, these systems are composed of heterogeneous hosts whose individual availability often exhibit different statistical properties (for example stationary versus nonstationary behavior) and fit different models (for example exponential, Weibull, or Pareto probability distributions). In this paper, we describe an effective method for discovering subsets of hosts whose availability have similar statistical properties and can be modeled with similar probability distributions. We apply this method with about 230,000 host availability traces obtained from a real Internet-distributed system, namely SETI@home. We find that about 21 percent of hosts exhibit availability, that is, a truly random process, and that these hosts can often be modeled accurately with a few distinct distributions from different families. We show that our models are useful and accurate in the context of a scheduling problem that deals with resource brokering. We believe that these methods and models are critical for the design of stochastic scheduling algorithms across large systems where host availability is uncertain.

[1]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[2]  Bruno Gaujal,et al.  Optimal routing in parallel, non-observable queues and the price of anarchy revisited , 2010, 2010 22nd International Teletraffic Congress (lTC 22).

[3]  Leonard Kleinrock,et al.  Collecting Unused Processing Capacity: An Analysis of Transient Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[4]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[5]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[6]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[7]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[8]  Alexandru Iosup,et al.  The performance of bags-of-tasks in large-scale distributed systems , 2008, HPDC '08.

[9]  B. Gaujal,et al.  On the Price of Anarchy and the Optimal Routing of Parallel non-Observable Queues , 2010 .

[10]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[11]  Arie Hordijk,et al.  Bounds for Deterministic Periodic Routing Sequences , 2001, IPCO.

[12]  Abhishek Chandra,et al.  Reputation-Based Scheduling on Unreliable Distributed Infrastructures , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[13]  S. Stidham,et al.  Individual versus Social Optimization in the Allocation of Customers to Alternative Servers , 1983 .

[14]  Miron Livny,et al.  The Available Capacity of a Privately Owned Workstation Environmont , 1991, Perform. Evaluation.

[15]  M. Pia,et al.  A goodness-of-fit statistical toolkit , 2004, IEEE Transactions on Nuclear Science.

[16]  Ying Wang Nonparametric Tests for Randomness , 2003 .

[17]  Jean-Marc Vincent,et al.  Mining for statistical models of availability in large-scale distributed systems: An empirical study of SETI@home , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[19]  David P. Anderson,et al.  On correlated availability in Internet-distributed systems , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[20]  Csaba Legány,et al.  Cluster validity measurement techniques , 2006 .

[21]  综合社会科学 World Community Grid , 2010 .

[22]  H. Casanova,et al.  Statistical Modeling of Resource Availability in Desktop Grids , 2007 .

[23]  Richard A. Davis,et al.  Introduction to time series and forecasting , 1998 .

[24]  Henri Casanova,et al.  Probabilistic allocation of tasks on desktop grids , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[25]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[26]  Arnold L. Rosenberg,et al.  Static Strategies for Worksharing with Unrecoverable Interruptions , 2013, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[27]  Joel H. Saltz,et al.  The utility of exploiting idle workstations for parallel computation , 1997, SIGMETRICS '97.

[28]  Daniel Stutzbach,et al.  Understanding churn in peer-to-peer networks , 2006, IMC '06.

[29]  Trilce Estrada,et al.  Modeling Job Lifespan Delays in Volunteer Computing Projects , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[30]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[31]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[32]  John R. Douceur Is remote host availability governed by a universal law? , 2003, PERV.

[33]  Gilles Fedak,et al.  EDGeS: Bridging EGEE to BOINC and XtremWeb , 2009, Journal of Grid Computing.

[34]  Rudolf Hornig,et al.  An overview of the OMNeT++ simulation environment , 2008, Simutools 2008.

[35]  Andrew A. Chien,et al.  Henri Casanova , 2022 .

[36]  Arnold L. Rosenberg,et al.  Static strategies forworksharing with unrecoverable interruptions , 2009, IPDPS.

[37]  Kenichi Hagihara,et al.  Computing Low Latency Batches with Unreliable Workers in Volunteer Computing Environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[38]  Gilles Fedak,et al.  The Computational and Storage Potential of Volunteer Computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).