WDCloud: An end to end system for large-scale watershed delineation on cloud

Watershed delineation is a process to compute the drainage area for a point on the land surface, which is a critical step in hydrologic and water resources analysis. However, existing watershed delineation tools are still insufficient to support hydrologists and watershed researchers due to the lack of essential capabilities such as fully leveraging scalable and high performance computing infrastructure (public cloud), and providing predictable performance for the delineation tasks. To solve these problems, this paper reports on WDCloud, which is a system for large-scale watershed delineation on public cloud. For the design and implementation of WDCloud, we employ three main approaches: 1) an automated catchment search mechanism for a public data set, 2) three performance improvement strategies (Data-reuse, parallel-union, and MapReduce), and 3) local linear regression-based execution time estimator for watershed delineation. Moreover, WDCloud extensively utilizes several compute and storage capabilities from Amazon Web Services in order to maximize the performance, scalability, and elasticity of watershed delineation system. Our evaluations on WDCloud focus on two main aspects of WDCloud; the performance improvement for watershed delineation via three strategies and the estimation accuracy for watershed delineation time by local linear regression. The evaluation results show that WDCloud can achieve 18x-111x of speed-ups for delineating any scale of watersheds in the contiguous United States as compared to commodity laptop environments, and accurately predict execution time for watershed delineation with 85.6% of prediction accuracy, which is 23%-13% higher than other state-of-the-art approaches.

[1]  Robert N. Eli,et al.  Watershed analysis with GIS: The watershed characterization and modeling system software application , 2010, Comput. Geosci..

[2]  Yanjun Qi,et al.  Comprehensive Elastic Resource Management to Ensure Predictable Performance for Scientific Applications on Public IaaS Clouds , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[3]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[4]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[5]  Warren Smith,et al.  Predicting Application Run Times Using Historical Information , 1998, JSSPP.

[6]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[7]  Anthony M. Castronova,et al.  A hierarchical network-based algorithm for multi-scale watershed delineation , 2014, Comput. Geosci..

[8]  C. L. Chang The impact of watershed delineation on hydrology and water quality simulation , 2009, Environmental monitoring and assessment.

[9]  Ian Foster,et al.  Predicting application run times with historical information , 2004, J. Parallel Distributed Comput..

[10]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Andrea Leone,et al.  Assessment of Open Source GIS Software for Water Resources Management in Developing Countries , 2010 .

[12]  Gagan Agrawal,et al.  Elastic Cloud Caches for Accelerating Service-Oriented Computations , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Jonathan L. Goodall,et al.  Calibration of watershed models using cloud computing , 2012, 2012 IEEE 8th International Conference on E-Science.

[14]  Paulo S. C. Alencar,et al.  Developing a collaborative cloud-based platform for watershed analysis and management , 2014, 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Duncan Temple Lang,et al.  Keyhole Markup Language , 2014 .

[17]  Manish Parashar,et al.  CometCloud: Enabling Software-Defined Federations for End-to-End Application Workflows , 2015, IEEE Internet Computing.

[18]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[19]  Hong Zhang,et al.  Dart: A Geographic Information System on Hadoop , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[20]  G. Bruce Berriman,et al.  On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[21]  Anthony M. Castronova,et al.  Calibration of SWAT models using the cloud , 2014, Environ. Model. Softw..