Data scheduling for large scale distributed applications

Current large scale distributed applications studied by large research communities result in new challenging problems in widely distributed environments. Especially, scientific experiments using geographically separated and heterogeneous resources necessitated transparently accessing distributed data and analyzing huge collection of information. We focus on data-intensive distributed computing and describe data scheduling approach to manage large scale scientific and commercial applications. We identify parameters affecting data transfer and also analyze different scenarios for possible use cases of data placement tasks to discover key attributes for performance optimization. We are planning to define crucial factors in data placement in widely distributed systems and develop a strategy to schedule data transfers according to characteristics of dynamically changing distributed environments.

[1]  M. Imase,et al.  On parameter tuning of data transfer protocol GridFTP for wide-area grid computing , 2005, 2nd International Conference on Broadband Networks, 2005..

[2]  Rajkumar Buyya,et al.  A grid service broker for scheduling distributed data-oriented applications on global grids , 2004, MGC '04.

[3]  Miron Livny,et al.  Profiling Grid Data Transfer Protocols and Servers , 2004, Euro-Par.

[4]  Brian Tierney,et al.  A TCP Tuning Daemon , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[5]  M. Pimiä,et al.  Compact muon solenoid , 1990 .

[6]  Susana E. Deustua,et al.  Overview of the SuperNova/Acceleration Probe (SNAP) , 2002, SPIE Astronomical Telescopes + Instrumentation.

[7]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[8]  Edwin A. Valentijn,et al.  Survey and other telescope technologies and discoveries , 2002 .

[9]  Miron Livny,et al.  Data pipelines: enabling large scale multi-protocol data transfers , 2004, MGC '04.

[10]  Ian J. Taylor,et al.  Visual Grid Workflow in Triana , 2005, Journal of Grid Computing.

[11]  Randall R. Stewart,et al.  Stream Control Transmission Protocol , 2000, RFC.

[12]  Jason Lee,et al.  A network-aware distributed storage cache for data intensive environments , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[13]  John S. Heidemann,et al.  Effects of ensemble-TCP , 2000, CCRV.

[14]  Steven Tuecke,et al.  GridFTP: Protocol Extensions to FTP for the Grid , 2001 .

[15]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[16]  J. Anthony Tyson,et al.  Large Synoptic Survey Telescope: Overview , 2002, SPIE Astronomical Telescopes + Instrumentation.

[17]  Phillip Dykstra High performance data transfer , 2006, SC.

[18]  Miron Livny,et al.  Data placement in widely distributed systems , 2005 .

[19]  Gabrielle Allen,et al.  The GridLab grid application toolkit , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[20]  John S. Heidemann,et al.  Ongoing TCP Research Related to Satellites , 2000, RFC.

[21]  Geoffrey C. Fox,et al.  Grid Computing: Making The Global Infrastructure a Reality: John Wiley & Sons , 2003 .

[22]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[23]  Sally Floyd,et al.  Congestion Control Principles , 2000, RFC.

[24]  Carl Kesselman,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[25]  Hiroyuki Ohsaki,et al.  GridFTP-APT: automatic parallelism tuning mechanism for data transfer protocol GridFTP , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[26]  Miron Livny,et al.  DISC: A System for Distributed Data Intensive Scientific Computing , 2004, WORLDS.

[27]  Bertram Ludäscher,et al.  Scientific workflow management and the Kepler system: Research Articles , 2006 .

[28]  Ying Ding,et al.  Algorithms for High Performance, Wide-Area Distributed File Downloads , 2003, Parallel Process. Lett..

[29]  Rajkumar Buyya,et al.  The Gridbus toolkit for service oriented grid and utility computing: an overview and status report , 2004, 1st IEEE International Workshop on Grid Economics and Business Models, 2004. GECON 2004..

[30]  Anne E. Trefethen,et al.  The Data Deluge: An e-Science Perspective , 2003 .

[31]  William E. Allcock,et al.  Reliable file transfer in Grid environments , 2002, 27th Annual IEEE Conference on Local Computer Networks, 2002. Proceedings. LCN 2002..

[32]  Jason Lee,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[33]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[34]  Gabrielle Allen,et al.  Generic support for bulk operations in grid applications , 2006, MCG '06.

[35]  William E. Johnston,et al.  Computing and Data Grids for Science and Engineering , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[36]  Koen Holtman,et al.  CMS Data Grid System Overview and Requirements , 2001 .

[37]  Kavitha Ranganathan,et al.  Computation scheduling and data replication algorithms for data Grids , 2004 .

[38]  M.S. Allen,et al.  The Livny and Plank-Beck Problems: Studies in Data Movement on the Computational Grid , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[39]  Mark Baker Ian Foster on recent changes in the grid community , 2004, IEEE Distributed Systems Online.

[40]  Tevfik Kosar A new paradigm in data intensive computing: Stork and the data-aware schedulers , 2006, 2006 IEEE Challenges of Large Applications in Distributed Environments.

[41]  Kavitha Ranganathan,et al.  Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids , 2003, Journal of Grid Computing.

[42]  Francine Berman,et al.  Overview of the Book: Grid Computing – Making the Global Infrastructure a Reality , 2003 .

[43]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[44]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[45]  Ian T. Foster,et al.  Data management and transfer in high-performance computational grid environments , 2002, Parallel Comput..

[46]  Francine Berman,et al.  Grid Computing: Making the Global Infrastructure a Reality , 2003 .

[47]  Erwin Laure The EU DataGrid Setting the Basis for Production Grids : Preface , 2004 .

[48]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.