Scaling Watershed Models: Modern Approaches to Science Computation with MapReduce, Parallelization, and Cloud Optimization

Environmental models are products of the computer architecture and software tools available at the time of development. Scientifically sound algorithms may persist in their original state even as system architectures and software development approaches evolve and progress. Dating back to the 1980s, the EPA has developed algorithms to estimate the flux of pesticides from treated fields to neighboring water bodies. Recent development of the EPA’s Spatial Aquatic Model (SAM) has provided an opportunity to redevelop, optimize and modernize this code used for regulatory decisions. Use of profiling has indicated a number of efficiencies that could be gained by updating the code to address execution time, memory utilization, CPU utilization and disk I/O issues. Porting the code to Python in order to access modern scientific computing and database libraries has allowed for a number of improvements and new use cases for SAM. These improvements include improved scalability, cloud infrastructure deployment, simpler cross-language communication and use of NoSQL databases. These improvements allow for SAM to be run as a service regardless of intensive input data, processing, and large output data requirements. Concurrent treatment of individual watersheds as embarrassingly parallel processes increases efficiency and scalability, while implementing MapReduce methods speeds up post-processing of the model outputs while accounting for network watershed structures. Converting to Python also has allowed the development process to leverage modern software testing frameworks and continuous integration design techniques. We discuss the experience of modernizing this code base with a goal of communicating useful design patterns for other science models.