Using Regression Techniques to Predict Large Data Transfers

The recent proliferation of Data Grids and the increasingly common practice of using resources as distributed data stores provide a convenient environment for communities of researchers to share, replicate, and manage access to copies of large datasets. This has led to the question of which replica can be accessed most efficiently. In such environments, fetching data from one of the several replica locations requires accurate predictions of end-to-end transfer times. The answer to this question can depend on many factors, including physical characteristics of the resources and the load behavior on the CPUs, networks, and storage devices that are part of the end-to-end data path linking possible sources and sinks. Our approach combines end-to-end application throughput observations with network and disk load variations and captures whole-system performance and variations in load patterns. Our predictions characterize the effect of load variations of several shared devices (network and disk) on file transfer times. We develop a suite of univariate and multivariate predictors that can use multiple data sources to improve the accuracy of the predictions as well as address Data Grid variations (availability of data and sporadic nature of transfers). We ran a large set of data transfer experiments using GridFTP and observed performance predictions within 15% error for our testbed sites, which is quite promising for a pragmatic system.

[1]  Liang Guo,et al.  The war between mice and elephants , 2001, Proceedings Ninth International Conference on Network Protocols. ICNP 2001.

[2]  Vikram S. Adve,et al.  Analyzing the behavior and performance of parallel programs , 1993 .

[3]  Prashant J. Shenoy,et al.  Rules of thumb in data engineering , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  Jason Lee,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[5]  Stephen F. Lundstrom,et al.  Predicting Performance of Parallel Computations , 1990, IEEE Trans. Parallel Distributed Syst..

[6]  Ian T. Foster,et al.  Replica selection in the Globus Data Grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[7]  Jason Lee,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[8]  Stefan Savage,et al.  Modeling the Performance of Short TCP Connections , 1998 .

[9]  J. Wishart,et al.  Statistics in Research. , 1956 .

[10]  M. A. Wincek Forecasting With Dynamic Regression Models , 1993 .

[11]  L. Miles,et al.  2000 , 2000, RDH.

[12]  Peter E. Kennedy Forecasting with dynamic regression models: Alan Pankratz, 1991, (John Wiley and Sons, New York), ISBN 0-471-61528-5, [UK pound]47.50 , 1992 .

[13]  Warren Smith,et al.  Predicting Application Run Times Using Historical Information , 1998, JSSPP.

[14]  Francine Berman,et al.  Performance prediction in production environments , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[15]  Alexander Thomasian,et al.  Analytic Queueing Network Models for Parallel Processing of Task Systems , 1986, IEEE Transactions on Computers.

[16]  Srinivasan Parthasarathy,et al.  Customized Dynamic Load Balancing for a Network of Workstations , 1997, J. Parallel Distributed Comput..

[17]  Richard Wolski,et al.  Multivariate Resource Performance Forecasting in the Network Weather Service , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[18]  George C. Polyzos,et al.  A time series model of long-term NSFNET backbone traffic , 1994, Proceedings of ICC/SUPERCOMM'94 - 1994 International Conference on Communications.

[19]  Javier Jaén Martínez,et al.  Data Management in an International Data Grid Project , 2000, GRID.

[20]  Peter A. Dinda,et al.  Host load prediction using linear models , 2000, Cluster Computing.

[21]  Ibrahim Matta,et al.  On class-based isolation of UDP, short-lived and long-lived TCP flows , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[22]  Ian T. Foster,et al.  Predicting the performance of wide area data transfers , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[23]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[24]  Valerie Taylor,et al.  Performance coupling: case studies for measuring the interactions of kernels in modern applications , 2001 .

[25]  Mark J. Clement,et al.  Analytical performance prediction on multicomputers , 1993, Supercomputing '93. Proceedings.

[26]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[27]  F. Berman,et al.  Adaptive Performance Prediction for Distributed Data-Intensive Applications , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[28]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[29]  Thomas W. Parsons,et al.  Digital signal processing: theory, applications, and hardware , 1991 .

[30]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[31]  Yin Zhang,et al.  Optimizing TCP Start-up Performance , 1999 .

[32]  Yin Zhang,et al.  Speeding Up Short Data Transfers: Theory, Architectural Support, and Simulation Results , 2000 .

[33]  Mark Crovella,et al.  Performance Prediction and Tuning of Parallel Programs , 1994 .

[34]  Boleslaw K. Szymanski,et al.  Simulation of dynamic data replication strategies in Data Grids , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[35]  A. L. Edwards,et al.  An introduction to linear regression and correlation. , 1985 .

[36]  Alexandre Vaniachine,et al.  Grid—Enabled Data Access in the ATLAS Athena Framework , 2001 .

[37]  Amarnath Mukherjee,et al.  Time series models for internet traffic , 1996, Proceedings of IEEE INFOCOM '96. Conference on Computer Communications.

[38]  Alok N. Choudhary,et al.  A distributed multi-storage resource architecture and I/O performance prediction for scientific computing , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[39]  E. Deelman,et al.  Data replication strategies in grid environments , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[40]  A. Harvey Time series models , 1983 .

[41]  Allen B. Downey Predicting queue times on space-sharing parallel computers , 1997, Proceedings 11th International Parallel Processing Symposium.

[42]  J. Schopf,et al.  Structural Prediction Models for High-Performance Distributed Applications , 1997 .

[43]  Srinivasan Parthasarathy,et al.  Customized dynamic load balancing for a network of workstations , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[44]  Heinz Stockinger,et al.  A data Grid prototype for distributed data production in CMS , 2002 .