Sustainable Software Ecosystems for Open Science

Mathematics is the core language of science, and for centuries it was necessary to show the mathematical underpinnings of new research as part of scientific explorations. This lingua franca provided an essential level of understandability and precision; providing for unambiguous communication and rigorous verification of scientific claims beyond the inaccuracies of spoken languages. However, the last few decades have seen an erosion in this paradigm. The increased reliance of science on complex computational codes and large data makes the description of all but the most basic research error prone, impenetrable, and unverifiable. This issue is not restricted to just one field of science, but is endemic throughout the broader scientific community and the consequences of opaque processes and lack of reproducibility are not trivial. Cases of irreproducible studies and clinical trials have been making headlines, from Bayer Health Care stopping nearly two-thirds of its target-validation projects because of inconsistencies with the initially published claims, to global economic policy being based on a single fundamentally flawed study by Harvard economists. These costly mistakes can be remedied much earlier, and before key decisions are made, simply by returning transparency and precision to the process of publication and review. The question we must address is how best to reinstate a common language and what that language should be. We believe that the only practical choice is to require that disclosures of scientific research based on complex codes and data use the very same complex codes and data as the common language of publication. This means that as new studies and new scientific explorations are undertaken, the data, methods, and software used by the researchers to arrive at their conclusions must be made available and accessible to other researchers and the general populace. If this goal is to be realized then the standard of software engineering in science must be improved, and sustainable software ecosystems with meaningful credit must be realized. This is not simply limited to teaching scientists to write code; if sustainable software projects are to be established in science then issues such as testing, licensing, and collaboration must be addressed. Sufficient engineering discipline is required to realize robust foundations that can be extended through the use of code review, regression testing and proper citation and attribution of software used in research.