A Workflow-Aware Storage System: An Opportunity Study

This paper evaluates the potential gains a workflow-aware storage system can bring. Two observations make us believe such storage system is crucial to efficiently support workflow-based applications: First, workflows generate irregular and application-dependent data access patterns. These patterns render existing storage systems unable to harness all optimization opportunities as this often requires conflicting optimization options or even conflicting design decision at the level of the storage system. Second, when scheduling, workflow runtime engines make suboptimal decisions as they lack detailed data location information. This paper discusses the feasibility, and evaluates the potential performance benefits brought by, building a workflow-aware storage system that supports per-file access optimizations and exposes data location. To this end, this paper presents approaches to determine the application-specific data access patterns, and evaluates experimentally the performance gains of a workflow-aware storage approach. Our evaluation using synthetic benchmarks shows that a workflow-aware storage system can bring significant performance gains: up to 7× performance gain compared to the distributed storage system - MosaStore and up to 16× compared to a central, well provisioned, NFS server.

[1]  Alexander S. Szalay,et al.  Accelerating large-scale data exploration through data diffusion , 2008, DADC '08.

[2]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[3]  Nazareno Andrade,et al.  enabling cross-layer optimizations in storage systems with custom metadata , 2008, HPDC '08.

[4]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[5]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[6]  Justin M. Wozniak,et al.  Case studies in storage access by loosely coupled petascale applications , 2009, PDSW '09.

[7]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Karan Gupta,et al.  GPFS-SNC: An enterprise storage framework for virtual-machine clouds , 2011, IBM J. Res. Dev..

[10]  Anne H. H. Ngu,et al.  Towards scientific workflow patterns , 2009, WORKS '09.

[11]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[12]  Daniel S. Katz,et al.  Many-Task Computing and Blue Waters , 2012, ArXiv.

[13]  Douglas Thain,et al.  The quest for scalable support of data-intensive workloads in distributed systems , 2009, HPDC '09.

[14]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Yingming Zhao,et al.  PTMap—A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites , 2009, Proceedings of the National Academy of Sciences.

[16]  Michelle Galea,et al.  Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science , 2013 .

[17]  GhemawatSanjay,et al.  The Google file system , 2003 .

[18]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[19]  Daniel S. Katz,et al.  AME: an anyscale many-task computing engine , 2011, WORKS '11.

[20]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21]  Daniel S. Katz,et al.  Montage: An Astronomical Image Mosaic Service for the NVO , 2005 .

[22]  Kenjiro Taura,et al.  File-access patterns of data-intensive workflow applications and their implications to distributed filesystems , 2010, HPDC '10.

[23]  Arif Merchant,et al.  Minerva: An automated resource provisioning tool for large-scale storage systems , 2001, TOCS.

[24]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[25]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[26]  Matei Ripeanu,et al.  The case for a versatile storage system , 2010, OPSR.

[27]  Matei Ripeanu,et al.  Towards automating the configuration of a distributed storage system , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.