Support for data-intensive, variable-granularity grid applications via distributed file system virtualization - a case study of light scattering spectroscopy

A key challenge faced by large-scale, distributed applications in grid environments is efficient, seamless data management. In particular, for applications that can benefit from access to data at variable granularities, data management can pose additional programming burdens to an application developer. This work presents a case for the use of virtualized distributed file systems as a basis for data management for data-intensive, variable-granularity applications. The approach leverages on-demand transfer mechanisms of existing, de-facto network file system clients and servers that support transfers of partial data sets in an application-transparent fashion, and complement them with user-level performance and functionality enhancements such as caching and encrypted communication channels. The paper uses a nascent application from the medical imaging field (light scattering spectroscopy -LSS) as a motivation for the approach, and as a basis for evaluating its performance. Results from performance experiments that consider the 16-processor parallel execution of LSS analysis and database generation programs show that, in the presence of data locality, a virtualized wide-area distributed file system setup and configured by grid middleware can achieve performance levels close (13% overhead or less) to that of a local disk, and superior (up to 680% speedup) to nonvirtualized distributed file systems.

[1]  M. Humphrey,et al.  LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[2]  Stuart E. Middleton,et al.  Medical Simulation Services via the Grid , 2003 .

[3]  Michael S. Feld,et al.  Imaging human epithelial properties with polarized light-scattering spectroscopy , 2001, Nature Medicine.

[4]  Renato J. O. Figueiredo,et al.  Enhancing the scalability and usability of computational grids via logical user accounts and virtual file systems , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[5]  Renato J. O. Figueiredo,et al.  A case for grid computing on virtual machines , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[6]  Douglas Thain,et al.  The Kangaroo approach to data movement on the Grid , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[7]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[8]  Andrew S. Grimshaw,et al.  Grid-based file access: the Legion I/O model , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[9]  Vadim Backman Erratum: Detection of preinvasive cancer cells (Nature (2000) 406 (35-36)) , 2000 .

[10]  Renato J. O. Figueiredo,et al.  VP/GFS: an Architecture for Virtual Private Grid File Systems , 2003 .

[11]  Francine Berman,et al.  Combining workstations and supercomputers to support grid applications: the parallel tomography experience , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[12]  Serge Miguet,et al.  ARAMIS: A Remote Access Medical Imaging System , 1999, ISCOPE.

[13]  Brent Callaghan,et al.  NFS Illustrated , 1999 .

[14]  Xiaomin Zhu,et al.  From virtualized resources to virtual computing grids: the In-VIGO system , 2005, Future Gener. Comput. Syst..

[15]  Monica S. Lam,et al.  Optimizing the migration of virtual computers , 2002, OPSR.

[16]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[17]  Micah Beck,et al.  The Internet Backplane Protocol: Storage in the Network , 1999 .

[18]  Mahadev Satyanarayanan,et al.  Andrew: a distributed personal computing environment , 1986, CACM.

[19]  Michael S. Feld,et al.  Polarized light scattering spectroscopy for quantitative measurement of epithelial cellular structures in situ , 1999 .

[20]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[21]  S. Shapshay,et al.  Detection of preinvasive cancer cells , 2000, Nature.

[22]  Francine Berman,et al.  Applying scheduling and tuning to on-line parallel tomography , 2001, SC '01.

[23]  Carl Kesselman,et al.  Real-time analysis, visualization, and steering of microtomography experiments at photon sources , 2000 .

[24]  Ian T. Foster,et al.  GASS: a data movement and access service for wide area computing systems , 1999, IOPADS '99.

[25]  Renato J. O. Figueiredo,et al.  The PUNCH virtual file system: seamless access to decentralized storage services in a computational grid , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[26]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[27]  Javier Jaén Martínez,et al.  Data Management in an International Data Grid Project , 2000, GRID.

[28]  R. Watson,et al.  Data Management , 1980, Bone Marrow Transplantation.