On Performance Resilient Scheduling for Scientific Workflows in HPC Systems with Constrained Storage Resources

Although the storage capacity is rapidly increasing, the size of datasets is also ever-growing, especially for those workflows in HPC that perform the parameter sweep studies. Consequently, the deadlock caused by the storage competition between concurrent workflow instances is still a major pragmatic concern and storage management remains important for high performance and throughput computing. In practice, there are various ways to this issue, ranging from admission control to deadlock resolution. Despite being a simple solution, the admission control is conservative and not space efficient to storage utilization. Therefore, in this paper, we study the performance of the deadlock resolution approach by proposing a resource allocation algorithm which is performance resilient to the workflows characterized by different features. The algorithm is designed based on our previous result, called DDS, which takes advantages of the dataflow information of the workflow to resolve deadlock based on detection&recovery principle. We improve DDS to allow it to not only resolve the deadlock but also overcome the performance anomaly, a not yet investigated problem in our previous studies. We thus called the improved algorithm performance-resilience algorithm, denoted as DDS+. The studies in this paper can be viewed as a follow-up research on DDS and show the performance behavior of the improved algorithm in various conditions. Therefore, the results in this paper are more useful to adapt DDS+ to the workflows with different characteristics in reality while keeping the performance stable.

[1]  Daniel S. Katz,et al.  Optimizing workflow data footprint , 2007, Sci. Program..

[2]  Ewa Deelman,et al.  Integration of Workflow Partitioning and Resource Provisioning , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[3]  Andrea C. Arpaci-Dusseau,et al.  Data-driven batch scheduling , 2009, DADC '09.

[4]  Arnold L. Rosenberg,et al.  On scheduling mesh-structured computations for Internet-based computing , 2004, IEEE Transactions on Computers.

[5]  Marc Spraragen,et al.  Simplifying construction of complex workflows for non-expert users of the Southern California Earthquake Center Community Modeling Environment , 2005, SGMD.

[6]  Johan Montagnat,et al.  Grid-enabled workflows for data intensive medical applications , 2005, 18th IEEE Symposium on Computer-Based Medical Systems (CBMS'05).

[7]  B. Barish,et al.  LIGO and the Detection of Gravitational Waves , 1999 .

[8]  Yang Wang,et al.  ACS: an effective admission control scheme with deadlock resolutions for workflow scheduling in clouds , 2014, Computing.

[9]  Yang Wang,et al.  WaFS: A Workflow-Aware File System for Effective Storage Utilization in the Cloud , 2015, IEEE Transactions on Computers.

[10]  Rajkumar Buyya,et al.  Scheduling of Scientific Workflows on Data Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[12]  Yang Wang,et al.  Maximizing Active Storage Resources with Deadlock Avoidance in Workflow-Based Computations , 2013, IEEE Transactions on Computers.

[13]  Ann L. Chervenak,et al.  Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[14]  Andrea C. Arpaci-Dusseau,et al.  Explicit Control in the Batch-Aware Distributed File System , 2004, NSDI.

[15]  Yang Wang,et al.  DDS: A deadlock detection-based scheduling algorithm for workflow computations in HPC systems with storage constraints , 2013, Parallel Comput..

[16]  Ewa Deelman,et al.  Online Fault and Anomaly Detection for Large-Scale Scientific Workflows , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[17]  Rizos Sakellariou,et al.  Scheduling Data-IntensiveWorkflows onto Storage-Constrained Distributed Resources , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[18]  D. Martin Swany,et al.  Online workflow management and performance analysis with Stampede , 2011, 2011 7th International Conference on Network and Service Management.

[19]  Daniel Marcu,et al.  Machine translation in the year 2004 , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..