Long-term availability prediction for groups of volunteer resources

Volunteer computing uses the free resources in Internet and Intranet environments for large-scale computation and storage. Currently, 70 applications use over 12 PetaFLOPS of computing power from such platforms. However, these platforms are currently limited to embarrassingly parallel applications. In an effort to broaden the set of applications that can leverage volunteer computing, we focus on the problem of predicting if a group of resources will be continuously available for a relatively long time period. Ensuring the collective availability of volunteer resources is challenging due to their inherent volatility and autonomy. Collective availability is important for enabling parallel applications and workflows on volunteer computing platforms. We evaluate our predictive methods using real availability traces gathered from hundreds of thousands of hosts from the [email protected] volunteer computing project. We show our prediction methods can guarantee reliably the availability of collections of volunteer resources. We show that this is particularly useful for service deployments over volunteer computing environments.

[1]  Darrell D. E. Long,et al.  A longitudinal survey of Internet host reliability , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[2]  Jaspal Subhlok,et al.  Volunteer Computing on Clusters , 2006, JSSPP.

[3]  David P. Anderson,et al.  On correlated availability in Internet-distributed systems , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[4]  David P. Anderson,et al.  Ensuring Collective Availability in Volatile Resource Pools Via Forecasting , 2008, DSOM.

[5]  Jean-Marc Vincent,et al.  Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  Uwe Schwiegelshohn,et al.  Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing , 2004 .

[7]  Paulo Marques,et al.  Resource usage of Windows computer laboratories , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[8]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[9]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[10]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[11]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[12]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[13]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[14]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[15]  Jean-Marc Vincent,et al.  Mining for statistical models of availability in large-scale distributed systems: An empirical study of SETI@home , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[16]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[17]  Andrea C. Arpaci-Dusseau,et al.  The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[18]  John R. Douceur Is remote host availability governed by a universal law? , 2003, PERV.

[19]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[20]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[21]  Andrew A. Chien,et al.  Henri Casanova , 2022 .

[22]  Jean-Marc Vincent,et al.  Mining for Availability Models in Large-Scale Distributed Systems:A Case Study of SETI@home , 2009 .

[23]  Gilles Fedak,et al.  The Computational and Storage Potential of Volunteer Computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[24]  Joseph L. Hellerstein,et al.  Predictive algorithms in the management of computer systems , 2002, IBM Syst. J..

[25]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[26]  Trilce Estrada,et al.  Modeling Job Lifespan Delays in Volunteer Computing Projects , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[27]  Jacky C. Chu,et al.  Availability and locality measurements of peer-to-peer file systems , 2002, SPIE ITCom.

[28]  David P. Anderson,et al.  A new major SETI project based on Project Serendip data and 100 , 1997 .

[29]  Gilles Fedak,et al.  Characterizing resource availability in enterprise desktop grids , 2007, Future Gener. Comput. Syst..

[30]  Ernesto Damiani,et al.  Advanced Internet Based Systems and Applications, Second International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2006, Hammamet, Tunisia, December 17-21, 2006, Revised Selected Papers , 2009, International Conference on Signal-Image Technology and Internet-Based Systems.

[31]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[32]  Krishna P. Gummadi,et al.  A measurement study of Napster and Gnutella as examples of peer-to-peer file sharing systems , 2002, CCRV.

[33]  Hanan Lutfiyya,et al.  Decentralized Resource Availability Prediction for a Desktop Grid , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[34]  Peter A. Dinda,et al.  Online Prediction of the Running Time of Tasks , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[35]  Jean-Marc Vincent,et al.  Visualization and Detection of Resource Usage Anomalies in Large Scale Distributed Systems , 2010 .

[36]  Gregory R. Ganger,et al.  On Correlated Failures in Survivable Storage Systems , 2002 .

[37]  Andreas Wombacher,et al.  DHT-Based Self-adapting Replication Protocol for Achieving High Data Availability , 2006, SITIS.