Making work queue cluster-friendly for data intensive scientific applications

Researchers with large-scale data-intensive applications often wish to scale up applications to run on multiple clusters, employing a middleware layer for resource management across clusters. However, at the very largest scales, such middleware is often “unfriendly” to individual clusters, which are usually designed to support communication within the cluster, not outside of it. To address this problem we have modified the Work Queue master-worker application framework to support a hierarchical configuration that more closely matches the physical architecture of existing clusters. Using a synthetic application we explore the properties of the system and evaluate its performance under multiple configurations, with varying worker reliability, network capabilities, and data requirements. We show that by matching the software and hardware architectures more closely we can gain both a modest improvement in runtime and a dramatic reduction in network footprint at the master. We then run a scalable molecular dynamics application (AWE) to examine the impact of hierarchy on performance, cost and efficiency for real scientific applications and see a 96% reduction in network footprint, making it much more palatable to system operators and opening the possibility of increasing the application scale by another order of magnitude or more.