Data Replication in Data Intensive Scientific Applications with Performance Guarantee

Data replication has been well adopted in data intensive scientific applications to reduce data file transfer time and bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data intensive applications, has proven to be NP-hard and even non approximable, making this problem difficult to solve. Meanwhile, most of the previous research in this field is either theoretical investigation without practical consideration, or heuristics-based with little or no theoretical performance guarantee. In this paper, we propose a data replication algorithm that not only has a provable theoretical performance guarantee, but also can be implemented in a distributed and practical manner. Specifically, we design a polynomial time centralized replication algorithm that reduces the total data file access delay by at least half of that reduced by the optimal replication solution. Based on this centralized algorithm, we also design a distributed caching algorithm, which can be easily adopted in a distributed environment such as Data Grids. Extensive simulations are performed to validate the efficiency of our proposed algorithms. Using our own simulator, we show that our centralized replication algorithm performs comparably to the optimal algorithm and other intuitive heuristics under different network parameters. Using GridSim, a popular distributed Grid simulator, we demonstrate that the distributed caching technique significantly outperforms an existing popular file caching technique in Data Grids, and it is more scalable and adaptive to the dynamic change of file access patterns in Data Grids.

[1]  Sang Boem Lim,et al.  Combination of Replication and Scheduling in Data Grids , 2007 .

[2]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[3]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[4]  Kavitha Ranganathan,et al.  Identifying Dynamic Replication Strategies for a High-Performance Data Grid , 2001, GRID.

[5]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[6]  Shudong Jin,et al.  Content and service replication strategies in multi-hop wireless mesh networks , 2005, MSWiM '05.

[7]  Floriano Zini,et al.  Evaluation of an economy-based file replication strategy for a data grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[8]  Deborah Estrin,et al.  Directed diffusion: a scalable and robust communication paradigm for sensor networks , 2000, MobiCom '00.

[9]  Carl Kesselman,et al.  Wide area data replication for scientific collaborations , 2005, Int. J. High Perform. Comput. Netw..

[10]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[11]  Soonwook Hwang,et al.  Improvement of Data Grid's Performance by Combining Job Scheduling with Dynamic Replication Strategy , 2007, Sixth International Conference on Grid and Cooperative Computing (GCC 2007).

[12]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[13]  Jesús Carretero,et al.  Branch replication scheme: A new model for data replication in large scale data grids , 2010, Future Gener. Comput. Syst..

[14]  Ruay-Shiung Chang,et al.  Job scheduling and data replication on data grids , 2007, Future Gener. Comput. Syst..

[15]  Ian Foster,et al.  The Grid: A New Infrastructure for 21st Century Science , 2002 .

[16]  Shubhashis Sengupta,et al.  Scalable and Distributed Mechanisms for Integrated Scheduling and Replication in Data Grids , 2008, ICDCN.

[17]  Shiyong Lu,et al.  Storing and Querying Scientific Workflow Provenance Metadata Using an RDBMS , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[18]  Floriano Zini,et al.  Analysis of Scheduling and Replica Optimisation Strategies for Data Grids Using OptorSim , 2004, Journal of Grid Computing.

[19]  Kavitha Ranganathan,et al.  Improving Data Availability through Dynamic Model-Driven Replication in Large Peer-to-Peer Communities , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[20]  Y. Wu,et al.  PhEDEx high-throughput data transfer management system , 2006 .

[21]  Ming Tang,et al.  The impact of data replication on job scheduling performance in the Data Grid , 2006, Future Gener. Comput. Syst..

[22]  L. Evans The Large Hadron Collider , 2012, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[23]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[24]  Ian T. Foster,et al.  The Globus Replica Location Service: Design and Experience , 2009, IEEE Transactions on Parallel and Distributed Systems.

[25]  Rizos Sakellariou,et al.  Scheduling Data-IntensiveWorkflows onto Storage-Constrained Distributed Resources , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[26]  Chaitanya Swamy,et al.  Approximation Algorithms for Data Placement Problems , 2008, SIAM J. Comput..

[27]  Lili Qiu,et al.  On the placement of Web server replicas , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[28]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[29]  H. Casanova,et al.  ACM SIGACT news distributed computing column 8 , 2002, SIGA.

[30]  Florian Schintke,et al.  Modeling Replica Availability in Large Data Grids , 2005, Journal of Grid Computing.

[31]  Jing Hua,et al.  A Reference Architecture for Scientific Workflow Management Systems and the VIEW SOA Solution , 2009, IEEE Transactions on Services Computing.

[32]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[33]  Douglas Thain,et al.  The quest for scalable support of data-intensive workloads in distributed systems , 2009, HPDC '09.

[34]  Song Jiang,et al.  Efficient distributed disk caching in data grid management , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[35]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[36]  C. Kesselman,et al.  A Metadata Catalog Service for Data Intensive Applications , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[37]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[38]  Bostjan Slivnik,et al.  The complexity of static data replication in data grids , 2005, Parallel Comput..

[39]  Shahram Ghandeharizadeh,et al.  Near Optimal Number of Replicas for Continuous Media in Ad-hoc Networks of Wireless Devices , 2004, Multimedia Information Systems.

[40]  Ruay-Shiung Chang,et al.  A dynamic data replication strategy using access-weights in data grids , 2008, The Journal of Supercomputing.

[41]  Won-Sik Yoon,et al.  Dynamic Data Grid Replication Strategy Based on Internet Hierarchy , 2003, GCC.

[42]  Rajmohan Rajaraman,et al.  Approximation algorithms for data placement in arbitrary networks , 2001, SODA '01.

[43]  Xiaoyan Hong,et al.  An on-line replication strategy to increase availability in Data Grids , 2008, Future Gener. Comput. Syst..

[44]  Srikumar Venugopal,et al.  A Set Coverage-based Mapping Heuristic for Scheduling Distributed Data-Intensive Applications on Global Grids , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[45]  D. Katz,et al.  The Montage architecture for grid-enabled science processing of large, distributed datasets , 2004 .

[46]  Rajkumar Buyya,et al.  A toolkit for modelling and simulating data Grids: an extension to GridSim , 2008 .

[47]  Kavitha Ranganathan,et al.  Design and Evaluation of Dynamic Replication Strategies for a High-Performance Data Grid , 2001 .

[48]  Bin Tang,et al.  Benefit-based Data Caching in Ad Hoc Networks , 2006, Proceedings of the 2006 IEEE International Conference on Network Protocols.

[49]  Ming Tang,et al.  Dynamic replication algorithms for the multi-tier Data Grid , 2005, Future Gener. Comput. Syst..

[50]  Distributed Data Management Services for Dynamic Data Grids , 2005 .

[51]  Javier Jaén Martínez,et al.  Models for replica synchronisation and consistency in a data grid , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[52]  Steve Dowers,et al.  From stand-alone programs towards grid-aware services and components: a case study in agricultural modelling with interpolated climate data , 2003, Environ. Model. Softw..

[53]  Guiran Chang,et al.  Utility-Based Replication Strategies in Data Grids , 2006, 2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06).

[54]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[55]  Brian Tierney,et al.  File and Object Replication in Data Grids , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[56]  Alexander S. Szalay,et al.  Accelerating large-scale data exploration through data diffusion , 2008, DADC '08.

[57]  Rajkumar Buyya,et al.  A taxonomy of Data Grids for distributed data sharing, management, and processing , 2005, CSUR.

[58]  Luciano Serafini,et al.  Towards an Economy-Based Optimisation of File Access and Replication on a Data Grid , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[59]  Boleslaw K. Szymanski,et al.  Simulation of dynamic data replication strategies in Data Grids , 2003, Proceedings International Parallel and Distributed Processing Symposium.