The State of Software for Evolutionary Biology

Abstract With Next Generation Sequencing data being routinely used, evolutionary biology is transforming into a computational science. Thus, researchers have to rely on a growing number of increasingly complex software. All widely used core tools in the field have grown considerably, in terms of the number of features as well as lines of code and consequently, also with respect to software complexity. A topic that has received little attention is the software engineering quality of widely used core analysis tools. Software developers appear to rarely assess the quality of their code, and this can have potential negative consequences for end‐users. To this end, we assessed the code quality of 16 highly cited and compute‐intensive tools mainly written in C/C++ (e.g., MrBayes, MAFFT, SweepFinder, etc.) and JAVA (BEAST) from the broader area of evolutionary biology that are being routinely used in current data analysis pipelines. Because, the software engineering quality of the tools we analyzed is rather unsatisfying, we provide a list of best practices for improving the quality of existing tools and list techniques that can be deployed for developing reliable, high quality scientific software from scratch. Finally, we also discuss journal as well as science policy and, more importantly, funding issues that need to be addressed for improving software engineering quality as well as ensuring support for developing new and maintaining existing software. Our intention is to raise the awareness of the community regarding software engineering quality issues and to emphasize the substantial lack of funding for scientific software development.

[1]  Valmir C. Barbosa,et al.  On best practices in the development of bioinformatics software , 2014, Front. Genet..

[2]  Thomas K. F. Wong,et al.  Phylogenomics resolves the timing and pattern of insect evolution , 2014, Science.

[3]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[4]  Alexey M. Kozlov,et al.  ExaML version 3: a tool for phylogenomic analyses on supercomputers , 2015, Bioinform..

[5]  Huai Liu,et al.  An innovative approach for testing bioinformatics programs using metamorphic testing , 2009, BMC Bioinformatics.

[6]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[7]  Murray Hill,et al.  Lint, a C Program Checker , 1978 .

[8]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[9]  Lionel C. Briand,et al.  Investigating quality factors in object-oriented designs: an industrial case study , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[10]  Tsong Yueh Chen,et al.  Metamorphic Testing: A New Approach for Generating Next Test Cases , 2020, ArXiv.

[11]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[12]  J. Huelsenbeck,et al.  The fossilized birth–death process for coherent calibration of divergence-time estimates , 2013, Proceedings of the National Academy of Sciences.

[13]  Kristian Rother,et al.  A toolbox for developing bioinformatics software , 2012, Briefings Bioinform..

[14]  V. Springel The Cosmological simulation code GADGET-2 , 2005, astro-ph/0505010.

[15]  Yuanyuan Zhou,et al.  Have things changed now?: an empirical study of bug characteristics in modern open source software , 2006, ASID '06.

[16]  Tsong Yueh Chen,et al.  How to test bioinformatics software? , 2015, Biophysical Reviews.

[17]  Peter B. Ladkin Causal Reasoning about Aircraft Accidents , 2000, SAFECOMP.

[18]  Alexandros Stamatakis,et al.  Two C++ libraries for counting trees on a phylogenetic terrace , 2018, Bioinform..

[19]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[20]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[21]  Yusaku Yamamoto,et al.  Roundoff error analysis of the Cholesky QR2 algorithm , 2015 .

[22]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[23]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[24]  David Clark,et al.  Safety and Security Analysis of Object-Oriented Models , 2002, SAFECOMP.

[25]  Yuanyuan Zhou,et al.  BugBench: Benchmarks for Evaluating Bug Detection Tools , 2005 .

[26]  C. A. R. Hoare,et al.  An axiomatic basis for computer programming , 1969, CACM.

[27]  B. Rannala,et al.  Bayesian species delimitation using multilocus sequence data , 2010, Proceedings of the National Academy of Sciences.

[28]  N. Abdelmalek Round off error analysis for Gram-Schmidt method and solution of linear least squares problems , 1971 .

[29]  Premkumar T. Devanbu,et al.  Assert Use in GitHub Projects , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[30]  Lionel C. Briand,et al.  Exploring the relationships between design measures and software quality in object-oriented systems , 2000, J. Syst. Softw..

[31]  Carlos Bustamante,et al.  Genomic scans for selective sweeps using SNP data. , 2005, Genome research.

[32]  Glenford J. Myers,et al.  Art of Software Testing , 1979 .

[33]  Alexandros Stamatakis,et al.  A Nuclear Ribosomal DNA Phylogeny of Acer Inferred with Maximum Likelihood, Splits Graphs, and Motif Analysis of 606 Sequences , 2006, Evolutionary bioinformatics online.

[34]  M. Holder,et al.  Hastings ratio of the LOCAL proposal used in Bayesian phylogenetics. , 2005, Systematic biology.

[35]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[36]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[37]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[38]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[39]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[40]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[41]  Lex Nederbragt,et al.  Good enough practices in scientific computing , 2016, PLoS Comput. Biol..

[42]  Eleni Giannoulatou,et al.  Verification and validation of bioinformatics software without a gold standard: a case study of BWA and Bowtie , 2014, BMC Bioinformatics.

[43]  Maxim Teslenko,et al.  MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space , 2012, Systematic biology.

[44]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[45]  Ian M. Mitchell,et al.  Best Practices for Scientific Computing , 2012, PLoS biology.

[46]  Lauretta O. Osho,et al.  Axiomatic Basis for Computer Programming , 2013 .

[47]  Joel Dudley,et al.  Bioinformatics software for biologists in the genomics era , 2007, Bioinform..

[48]  Ari Löytynoja,et al.  An algorithm for progressive multiple alignment of sequences with insertions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Julia L. Lawall,et al.  Finding Error Handling Bugs in OpenSSL Using Coccinelle , 2010, 2010 European Dependable Computing Conference.

[50]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[51]  Jason Williams,et al.  Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators , 2017, bioRxiv.

[52]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.