File-access patterns of data-intensive workflow applications and their implications to distributed filesystems

This paper studies five real-world data intensive workflow applications in the fields of natural language processing, astronomy image analysis, and web data analysis. Data intensive workflows are increasingly becoming important applications for cluster and Grid environments. They open new challenges to various components of workflow execution environments including job dispatchers, schedulers, file systems, and file staging tools. The keys to achieving high performance are efficient data sharing among executing hosts and locality-aware scheduling that reduces the amount of data transfer. While much work has been done on scheduling workflows, many of them use synthetic or random workload. As such, their impacts on real workloads are largely unknown. Understanding characteristics of real-world workflow applications is a required step to promote research in this area. To this end, we analyse real-world workflow applications focusing on their file access patterns and summarize their implications to schedulers and file system/staging designs.

[1]  Karsten Schwan,et al.  DataStager: scalable data staging services for petascale applications , 2009, HPDC '09.

[2]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[3]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[4]  Douglas Thain,et al.  The quest for scalable support of data-intensive workloads in distributed systems , 2009, HPDC '09.

[5]  Selim G. Akl,et al.  Scheduling Algorithms for Grid Computing: State of the Art and Open Problems , 2006 .

[6]  Akinori Yonezawa,et al.  ParaTrac: a fine-grained profiler for data-intensive workflows , 2010, HPDC '10.

[7]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[8]  Weisong Shi,et al.  An Adaptive Rescheduling Strategy for Grid Workflow Applications , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[10]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[11]  Daisuke Kawahara,et al.  The Effect of Corpus Size on Case Frame Acquisition for Discourse Analysis , 2009, NAACL.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Akinori Yonezawa,et al.  GMount: An Ad Hoc and Locality-Aware Distributed File System by Using SSH and FUSE , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[14]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[15]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[16]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[17]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[18]  Ken Kennedy,et al.  Scheduling strategies for mapping application workflows onto the grid , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[19]  Satoshi Matsuoka,et al.  Grid Datafarm Architecture for Petascale Data Intensive Computing , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[20]  Daisuke Kawahara,et al.  Japanese case structure analysis by unsupervised construction of a case frame dictionary , 2000, COLING 2000.

[21]  Rizos Sakellariou,et al.  A hybrid heuristic for DAG scheduling on heterogeneous systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[22]  Daniel S. Katz,et al.  Web-based Tools -- Montage: An astronomical image mosaic engine , 2007 .

[23]  Yaohui Jin,et al.  Resource Allocation Strategies for Data-Intensive Workflow-Based Applications in Optical Grids , 2006, 2006 10th IEEE Singapore International Conference on Communication Systems.

[24]  Carl Smith,et al.  NFS Version 3: Design and Implementation , 1994, USENIX Summer.

[25]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[26]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[27]  Takeaki Uno A New Approach for Speeding Up Enumeration Algorithms , 1998, ISAAC.

[28]  Tevfik Kosar A new paradigm in data intensive computing: Stork and the data-aware schedulers , 2006, 2006 IEEE Challenges of Large Applications in Distributed Environments.

[29]  Joel H. Saltz,et al.  Task Scheduling and File Replication for Data-Intensive Jobs with Batch-shared I/O , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.