Wrangling distributed computing for high-throughput environmental science: An introduction to HTCondor

Biologists and environmental scientists now routinely solve computational problems that were unimaginable a generation ago. Examples include processing geospatial data, analyzing -omics data, and running large-scale simulations. Conventional desktop computing cannot handle these tasks when they are large, and high-performance computing is not always available nor the most appropriate solution for all computationally intense problems. High-throughput computing (HTC) is one method for handling computationally intense research. In contrast to high-performance computing, which uses a single "supercomputer," HTC can distribute tasks over many computers (e.g., idle desktop computers, dedicated servers, or cloud-based resources). HTC facilities exist at many academic and government institutes and are relatively easy to create from commodity hardware. Additionally, consortia such as Open Science Grid facilitate HTC, and commercial entities sell cloud-based solutions for researchers who lack HTC at their institution. We provide an introduction to HTC for biologists and environmental scientists. Our examples from biology and the environmental sciences use HTCondor, an open source HTC system.

[1]  J. T. Childers,et al.  Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC , 2012 .

[2]  K. Beven Rainfall-Runoff Modelling: The Primer , 2012 .

[3]  Xiao-Lin Wu,et al.  A Primer on High-Throughput Computing for Genomic Selection , 2011, Front. Gene..

[4]  J. Letts,et al.  How much higher can HTCondor fly? , 2015 .

[5]  Berk Ekmekci,et al.  An Introduction to Programming for Bioscientists: A Python-Based Primer , 2016, PLoS Comput. Biol..

[6]  T. Dawson,et al.  Predicting the impacts of climate change on the distribution of species: are bioclimate envelope models useful? , 2003 .

[7]  G. Henebry,et al.  Remote sensing of vegetation 3-D structure for biodiversity and habitat: Review and implications for lidar and radar spaceborne missions , 2009 .

[8]  Aaron R. Cupp,et al.  Responses of invasive silver and bighead carp to a carbon dioxide barrier in outdoor ponds , 2017 .

[9]  S. Schneider,et al.  Emissions pathways, climate change, and impacts on California. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[10]  B. Gray,et al.  Estimating linear temporal trends from aggregated environmental monitoring data , 2017 .

[11]  Michael N Fienen,et al.  On Constraining Pilot Point Calibration with Regularization in PEST , 2009, Ground water.

[12]  Howard Eisner,et al.  Information Age , 2008, Wiley Encyclopedia of Computer Science and Engineering.

[13]  Michael N Fienen,et al.  High-throughput computing versus high-performance computing for groundwater applications. , 2015, Ground water.

[14]  Gustavo A. Isaza,et al.  GITIRBio: A Semantic and Distributed Service Oriented- Architecture for Bioinformatics Pipeline , 2015, J. Integr. Bioinform..

[15]  吴树峰 从学徒到大师之路--读《 The Pragmatic Programmer, From Journeyman to Master》 , 2007 .

[16]  Spengler Sj Techview: computers and biology. Bioinformatics in the information age. , 2000 .

[17]  Jia Liu,et al.  High-Throughput Geocomputational Workflows in a Grid Environment , 2015, Computer.

[18]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[19]  H. L. Miller,et al.  Global climate projections , 2007 .

[20]  Rolf Hut,et al.  Let hydrologists learn the latest computer science by working with Research Software Engineers (RSEs) and not reinvent the waterwheel ourselves. A comment to “Most Computational Hydrology is not Reproducible, so is it Really Science?” , 2017 .

[21]  N. Takegawa,et al.  Rapid aerosol particle growth and increase of cloud condensation nucleus activity by secondary aerosol formation and condensation: A case study for regional air pollution in northeastern China , 2009 .

[22]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[23]  Potential impacts of climate change on the ecology of dengue and its mosquito vector the Asian tiger mosquito (Aedes albopictus) , 2012 .

[24]  David Thomas,et al.  The Pragmatic Programmer: From Journeyman to Master , 1999 .

[25]  Yufeng Xin,et al.  Evaluating I/O aware network management for scientific workflows on networked clouds , 2013, NDM '13.

[26]  A. Budden,et al.  Big data and the future of ecology , 2013 .

[27]  Sylvia J. Spengler,et al.  Bioinformatics in the Information Age , 2000, Science.

[28]  Stephan Getzin,et al.  Assessing biodiversity in forests using very high‐resolution images and unmanned aerial vehicles , 2012 .

[29]  W. Collins,et al.  Global climate projections , 2007 .

[30]  Fernando Harald Barreiro Megino,et al.  Commissioning the CERN IT Agile Infrastructure with experiment workloads , 2014 .

[31]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[32]  Jorge Luis Rodriguez,et al.  The Open Science Grid , 2005 .

[33]  Rolf Hut,et al.  Comment on “Most computational hydrology is not reproducible, so is it really science?” by Christopher Hutton et al.: Let hydrologists learn the latest computer science by working with Research Software Engineers (RSEs) and not reinvent the waterwheel ourselves , 2017 .