Position Paper on Dataset Engineering to Accelerate Science

Data is a critical element in any discovery process. In the last decades, we observed exponential growth in the volume of available data and the technology to manipulate it. However, data is only practical when one can structure it for a well-defined task. For instance, we need a corpus of text broken into sentences to train a natural language machine-learning model. In this work, we will use the token \textit{dataset} to designate a structured set of data built to perform a well-defined task. Moreover, the dataset will be used in most cases as a blueprint of an entity that at any moment can be stored as a table. Specifically, in science, each area has unique forms to organize, gather and handle its datasets. We believe that datasets must be a first-class entity in any knowledge-intensive process, and all workflows should have exceptional attention to datasets' lifecycle, from their gathering to uses and evolution. We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets, claiming for new approaches and tooling. Furthermore, these requirements are more evident when the discovery workflow uses artificial intelligence methods to empower the subject-matter expert. In this work, we discuss an approach to bringing datasets as a critical entity in the discovery process in science. We illustrate some concepts using material discovery as a use case. We chose this domain because it leverages many significant problems that can be generalized to other science fields.

[1]  B. Keavney,et al.  A review of causal discovery methods for molecular network analysis , 2022, Molecular genetics & genomic medicine.

[2]  Edward O. Pyzer-Knapp,et al.  Accelerating materials discovery using artificial intelligence, high performance computing and robotics , 2022, npj Computational Materials.

[3]  M. Vijver,et al.  ZZS similarity tool: The online tool for similarity screening to identify chemicals of potential concern , 2022, J. Comput. Chem..

[4]  P. Hu,et al.  Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means , 2022, BMC Bioinformatics.

[5]  Caterine Silva de Oliveira,et al.  Smart Knowledge Engineering for Cognitive Systems: A Brief Overview , 2022, Cybern. Syst..

[6]  B. Smit,et al.  Data-driven matching of experimental crystal structures and gas adsorption isotherms of metal-organic frameworks , 2021, Journal of Chemical & Engineering Data.

[7]  Samuel C. Hoffman,et al.  Sample-Efficient Generation of Novel Photo-acid Generator Molecules using a Deep Generative Model , 2021, ArXiv.

[8]  Anna Saranti,et al.  Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI , 2021, Inf. Fusion.

[9]  M. Ziemann,et al.  Gene name errors: Lessons not learned , 2021, bioRxiv.

[10]  S. Das,et al.  AutoGraph: Autonomous Graph-Based Clustering of Small-Molecule Conformations , 2020, J. Chem. Inf. Model..

[11]  José L. Medina-Franco,et al.  Chemoinformatics-based enumeration of chemical libraries: a tutorial , 2020, Journal of Cheminformatics.

[12]  Mike Schaekermann,et al.  Human-AI Interaction in the Presence of Ambiguity: From Deliberation-based Labeling to Ambiguity-aware AI , 2020 .

[13]  Changwon Suh,et al.  Evolving the Materials Genome: How Machine Learning Is Fueling the Next Generation of Materials Discovery , 2020, Annual Review of Materials Research.

[14]  Douglas B. Kell,et al.  VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder , 2020, bioRxiv.

[15]  A. Leach,et al.  An open source chemical structure curation pipeline using RDKit , 2020, Journal of Cheminformatics.

[16]  Robert C. Sinclair,et al.  Uncertainty quantification in classical molecular dynamics , 2020, Philosophical Transactions of the Royal Society A.

[17]  Regina Barzilay,et al.  Uncertainty Quantification Using Neural Networks for Molecular Property Prediction , 2020, J. Chem. Inf. Model..

[18]  Egon L. Willighagen,et al.  FAIR Principles: Interpretations and Implementation Considerations , 2020, Data Intelligence.

[19]  Plamen Angelov,et al.  Towards Explainable Deep Neural Networks (xDNN) , 2019, Neural Networks.

[20]  Pedram Daee,et al.  Interactive AI with a Theory of Mind , 2019, ArXiv.

[21]  Marta Mattoso,et al.  Efficient Runtime Capture of Multiworkflow Data Using Provenance , 2019, 2019 15th International Conference on eScience (eScience).

[22]  Yunfeng Zhang,et al.  AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias , 2019, IBM Journal of Research and Development.

[23]  Marta Mattoso,et al.  Keeping Track of User Steering Actions in Dynamic Workflows , 2019, Future Gener. Comput. Syst..

[24]  Dan Brickley,et al.  Google Dataset Search: Building a search engine for datasets in an open Web ecosystem , 2019, WWW.

[25]  John L. Markley,et al.  Automated evaluation of consistency within the PubChem Compound database , 2019, Scientific Data.

[26]  Tanmoy Bhattacharya,et al.  The need for uncertainty quantification in machine-assisted medical decision making , 2019, Nat. Mach. Intell..

[27]  Darren Edge,et al.  Bringing AI to BI: Enabling Visual Analytics of Unstructured Data in a Modern Business Intelligence Platform , 2018, CHI Extended Abstracts.

[28]  Tatsuya Takagi,et al.  Mordred: a molecular descriptor calculator , 2018, Journal of Cheminformatics.

[29]  Melanie Herschel,et al.  A survey on provenance: What for? What form? What from? , 2017, The VLDB Journal.

[30]  Fabio Massimo Zanzotto Human-in-the-loop Artificial Intelligence , 2017, ArXiv.

[31]  Sherif Sakr,et al.  Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service , 2017, Big Data Res..

[32]  Paolo Missier,et al.  Facilitating reproducible research by investigating computational metadata , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[33]  Alex M. Clark,et al.  Machine Learning Model Analysis and Data Visualization with Small Molecules Tested in a Mouse Model of Mycobacterium tuberculosis Infection (2014–2015) , 2016, J. Chem. Inf. Model..

[34]  Daniel de Oliveira,et al.  Analyzing related raw data files through dataflows , 2016, Concurr. Comput. Pract. Exp..

[35]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[36]  Jillian Aurisano,et al.  ReactionFlow: an interactive visualization tool for causality analysis in biological pathways , 2015, BMC Proceedings.

[37]  Stefan Kramer,et al.  CheS-Mapper 2.0 for visual validation of (Q)SAR models , 2014, Journal of Cheminformatics.

[38]  Jean-Louis Reymond,et al.  Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 , 2012, J. Chem. Inf. Model..

[39]  Noel M. O'Boyle Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI , 2012, Journal of Cheminformatics.

[40]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[41]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[42]  Peter Fox,et al.  Changing the Equation on Scientific Data Visualization , 2011, Science.

[43]  Chong Ho Yu,et al.  Exploratory data analysis in the context of data mining and resampling. , 2010 .

[44]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[45]  E. L. Fink,et al.  The FAQs on Data Transformation , 2009 .

[46]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[47]  Cássia Trojahn dos Santos,et al.  A FAIR Core Semantic Metadata Model for FAIR Multidimensional Tabular Datasets , 2022, EKAW.

[48]  Rebecca Nugent,et al.  An overview of clustering applied to molecular biology. , 2010, Methods in molecular biology.

[49]  Penny Rheingans,et al.  Visualization of Molecules with Positional Uncertainty , 1999, VisSym.