Democratic databases: science on GitHub

When the Ebola outbreak in West Africa picked up pace in July 2014, Caitlin Rivers started to collect data on the people affected. Rivers, then a PhD student in computational epidemiology, wanted to model the outbreak’s spread. So every day she downloaded PDF updates released by the ministries of health of the virus-stricken countries, and converted the numbers into computerreadable tables. Rather than keeping these files to herself, she posted them to GitHub.com, a hugely popular website for collaborative work on software code. Rivers thought the postings might attract those interested in up-to-date information from the Ebola outbreak. “I figured if I needed it, other people would, too,” she says. Rivers was right. Other researchers began to download the data and contribute to the project. On some days, third parties would download and convert the ministries’ data before her, and load them into the GitHub repository. Others created programming scripts to do simple error-checks on the data, such as ensuring that the daily patient counts made sense. At the time, GitHub was “really the only place on the Internet that you could interact with these data as data, and not as a PDF”, says Rivers, who was at Virginia Polytechnic Institute and State University in Blacksburg when she began the project, and is now an epidemiologist at the US Army Public Health Center in Edgewood, Maryland. Launched in 2008 to assist software developers, GitHub now boasts some 15 million users and is an increasingly popular site for researchers to share, maintain and update scientific data sets and code (see ‘Growing influence of GitHub’). GitHub is “the biggest revelation in my workflow ... since I started writing code”, says Daniel Falster, a postdoctoral researcher in ecology at Macquarie University in Sydney, Australia. “When we started using GitHub, it was just amazing. We now use it in everything that we do.” Falster’s Biomass and Allometry Database, which aggregates various measures of plant size from 176 studies, is stored on the site. So is the Open Tree of Life project, which aims to compile different published phylogenies to build one master ‘tree of life’. It uses GitHub to store data files and publication records, and to accept new data sets from third parties. Plenty of websites are dedicated to sharing data. But GitHub is specifically designed for transparent, open collaboration because it Scientists are turning to a softwaredevelopment site to share data and code. DEMOCRATIC DATABASES: SCIENCE ON GITHUB