Service Placement in a Shared Wide-Area Platform

Emerging federated computing environments offer attractive platforms to test and deploy global-scale distributed applications. When nodes in these platforms are timeshared among competing applications, available resources vary across nodes and over time. Thus, one open architectural question in such systems is how to map applications to available nodes--that is, how to discover and select resources. Using a six-month trace of PlanetLab resource utilization data and of resource demands from three long-running PlanetLab services, we quantitatively characterize resource availability and application usage behavior across nodes and over time, and investigate the potential to mitigate the application impact of resource variability through intelligent service placement and migration. We find that usage of CPU and network resources is heavy and highly variable. We argue that this variability calls for intelligently mapping applications to available nodes. Further, we find that node placement decisions can become ill-suited after about 30 minutes, suggesting that some applications can benefit from migration at that timescale, and that placement and migration decisions can be safely based on data collected at roughly that timescale. We find that inter-node latency is stable and is a good predictor of available bandwidth; this observation argues for collecting latency data at relatively coarse timescales and bandwidth data at even coarser timescales, using the former to predict the latter between measurements. Finally, we find that although the utilization of a particular resource on a particular node is a good predictor of that node's utilization of that resource in the near future, there do not exist correlations to support predicting one resource's availability based on availability of other resources on the same node at the same time, on availability of the same resource on other nodes at the same site, or on time-series forecasts that assume a daily or weekly regression to the mean.

[1]  Jon Crowcroft,et al.  FutureGRID: A Program for long-term research into GRID systems architecture , 2008 .

[2]  Brighten Godfrey,et al.  OpenDHT: a public DHT service and its uses , 2005, SIGCOMM '05.

[3]  David Mazières,et al.  Democratizing Content Publication with Coral , 2004, NSDI.

[4]  Donald F. Towsley,et al.  Modeling TCP throughput: a simple model and its empirical validation , 1998, SIGCOMM '98.

[5]  Scott Shenker,et al.  Fixing the Embarrassing Slowness of OpenDHT on PlanetLab , 2005, WORLDS.

[6]  Ian T. Foster,et al.  Homeostatic and tendency-based CPU load predictions , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[7]  Larry L. Peterson,et al.  Using PlanetLab for network research: myths, realities, and best practices , 2005, OPSR.

[8]  Amin Vahdat,et al.  Design and implementation tradeoffs for wide-area resource discovery , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[9]  Amin Vahdat,et al.  SHARP: an architecture for secure resource peering , 2003, SOSP '03.

[10]  Timothy L. Harris,et al.  XenoSearch: distributed resource discovery in the XenoServer open platform , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[11]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[12]  Jennifer M. Schopf,et al.  Performance analysis of the Globus Toolkit Monitoring and Discovery Service, MDS2 , 2004, IEEE International Conference on Performance, Computing, and Communications, 2004.

[13]  Lingyun Yang,et al.  Conservative Scheduling: Using Predicted Variance to Improve Scheduling Decisions in Dynamic Environments , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[14]  Indranil Gupta,et al.  MON: On-Demand Overlays for Distributed System Management , 2005, WORLDS.

[15]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[16]  Mor Harchol-Balter,et al.  Exploiting process lifetime distributions for dynamic load balancing , 1995, SIGMETRICS.

[17]  Jason Lee,et al.  The Grid2003 production grid: principles and practice , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[18]  Mike Hibler,et al.  Implementing the Emulab-PlanetLab Portal: Experience and Lessons Learned , 2004, WORLDS.

[19]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[20]  Amin Vahdat,et al.  Resource Allocation in Federated Distributed Computing Infrastructures , 2004 .

[21]  Yin Zhang,et al.  The Stationarity of Internet Path Properties: Routing, Loss, and Throughput , 2000 .

[22]  Ian T. Foster,et al.  Globus and PlanetLab resource management solutions compared , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[23]  Larry L. Peterson,et al.  Sophia: an Information Plane for networked systems , 2004, Comput. Commun. Rev..

[24]  Srinivasan Seshan,et al.  Analyzing stability in wide-area network performance , 1997, SIGMETRICS '97.

[25]  Randy H. Katz,et al.  An algebraic approach to practical and scalable overlay network monitoring , 2004, SIGCOMM '04.

[26]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[27]  Ian T. Foster,et al.  Predicting the performance of wide area data transfers , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[28]  Larry L. Peterson,et al.  Reliability and Security in the CoDeeN Content Distribution Network , 2004, USENIX Annual Technical Conference, General Track.