Towards a Data Science Collaboratory

Data-driven research requires many people from different domains to collaborate efficiently. The domain scientist collects and analyzes scientific data, the data scientist develops new techniques, and the tool developer implements, optimizes and maintains existing techniques to be used throughout science and industry. Today, however, this data science expertise lies fragmented in loosely connected communities and scattered over many people, making it very hard to find the right expertise, data and tools at the right time. Collaborations are typically small and cross-domain knowledge transfer through the literature is slow. Although progress has been made, it is far from easy for one to build on the latest results of the other and collaborate effortlessly across domains. This slows down data-driven research and innovation, drives up costs and exacerbates the risks associated with the inappropriate use of data science techniques. We propose to create an open, online collaboration platform, a ‘collaboratory’ for data-driven research, that brings together data scientists, domain scientists and tool developers on the same platform. It will enable data scientists to evaluate their latest techniques on many current scientific datasets, allow domain scientists to discover which techniques work best on their data, and engage tool developers to share in the latest developments. It will change the scale of collaborations from small to potentially massive, and from periodic to real-time. This will be an inclusive movement operating across academia, healthcare, and industry, and empower more students to engage in data science. Fig. 1. Roles within the data science ecosystem and the gaps between them.

[1]  Thomas Vogt,et al.  Reinventing Discovery: The New Era of Networked Science , 2012 .

[2]  Chunlei Wu,et al.  BioGPS and MyGene.info: organizing online, gene-centric information , 2012, Nucleic Acids Res..

[3]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[4]  Anne E. Trefethen,et al.  Toward interoperable bioscience data , 2012, Nature Genetics.

[5]  Ugo Becciani,et al.  Scientific Workflow Management -- For Whom? , 2014, 2014 IEEE 10th International Conference on e-Science.

[6]  Janet M Thornton,et al.  ELIXIR: a distributed infrastructure for European biological data. , 2012, Trends in biotechnology.

[7]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[8]  J. Ioannidis Why Most Discovered True Associations Are Inflated , 2008, Epidemiology.

[9]  Luís Torgo,et al.  A RapidMiner extension for open machine learning , 2013 .

[10]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[11]  Sean Bechhofer,et al.  Research Objects: Towards Exchange and Reuse of Digital Knowledge , 2010 .

[12]  V. Stodden,et al.  Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals , 2013, PloS one.

[13]  Helen Shen,et al.  Interactive notebooks: Sharing the code , 2014, Nature.

[14]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[15]  J L Edwards,et al.  Interoperability of biodiversity databases: biodiversity information on every desktop. , 2000, Science.

[16]  Hiroaki Kitano,et al.  The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models , 2003, Bioinform..

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  F. Collins,et al.  NIH plans to enhance reproducibility , 2014 .

[19]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[20]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[21]  J. Carpenter May the best analyst win. , 2011, Science.

[22]  Jon W. Huss,et al.  BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources , 2009, Genome Biology.

[23]  Ian P. Gent The Recomputation Manifesto , 2013, ArXiv.

[24]  David Charles De Roure,et al.  myExperiment: social networking for workflow-using e-scientists , 2007, WORKS '07.

[25]  Oliver Hofmann,et al.  ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level , 2010, Bioinform..

[26]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[27]  Haym Hirsh Data Mining Research: Current Status and Future Opportunities , 2008, Stat. Anal. Data Min..

[28]  Anton Nekrutenko,et al.  Dissemination of scientific software with Galaxy ToolShed , 2014, Genome Biology.

[29]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[30]  Regina Nuzzo,et al.  Scientific method: Statistical errors , 2014, Nature.

[31]  Carole Goble,et al.  The SEEK: a platform for sharing data and models in systems biology. , 2011, Methods in enzymology.

[32]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.