Smart Data for Genomic Information Systems: the SILE Method

During the last two decades, data generated by Next Generation Sequencing Technologies have revolutionized our understanding of human biology and improved the study on how changes (variations) in the DNA are involved in the risk of suffering a certain disease. A huge amount of genomic data is publicly available and frequently used by the research community in order to extract meaningful and reliable gene-disease relationships. However, the management of this exponential growth of data has become a challenge for biologists. Under such a Big Data problem perspective, they are forced to delve into a lake of complex data spread in over thousand heterogeneous repositories, represented in multiple formats and with different levels of quality; but when data are used to solve a concrete problem only a small part of that “data lake” is really significant; this is what we call the “smart” data perspective. By using conceptual models and the principles of data quality management, adapted to the genomic domain, we propose a systematic approach called SILE method to move from a Big Data to a Smart Data perspective. The aim of this approach is to populate an Information System with genomic data which are accessible, informative and actionable enough to extract valuable knowledge.

[1]  M. Vihinen Variation Ontology for annotation of variation effects and mechanisms , 2014, Genome research.

[2]  T. Murdoch,et al.  The inevitable application of big data to health care. , 2013, JAMA.

[3]  Timothy Moore,et al.  SIRT3 activator Honokiol attenuates β-Amyloid by modulating amyloidogenic pathway , 2018, PloS one.

[4]  Oscar Pastor,et al.  Conceptual Modeling of Human Genome: Integration Challenges , 2012, Conceptual Modelling and Its Theoretical Foundations.

[5]  Celeste M Condit,et al.  The changing meanings of “mutation:” A contextualized study of public discourse , 2002, Human mutation.

[6]  Ana León,et al.  GeIS based on Conceptual Models for the risk assessment of Neuroblastoma , 2017, 2017 11th International Conference on Research Challenges in Information Science (RCIS).

[7]  Carlo Batini,et al.  Data Quality Dimensions , 2016 .

[8]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[9]  K. Boycott,et al.  Rare-disease genetics in the era of next-generation sequencing: discovery to translation , 2013, Nature Reviews Genetics.

[10]  Can Zhang,et al.  Cromolyn Reduces Levels of the Alzheimer’s Disease-Associated Amyloid β-Protein by Promoting Microglial Phagocytosis , 2018, Scientific Reports.

[11]  Andrea Splendiani,et al.  Ontologies for Bioinformatics , 2014 .

[12]  The Gene Ontology Consortium Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[13]  Ana León,et al.  Data Quality Problems When Integrating Genomic Information , 2016, ER Workshops.

[14]  Cees T. A. M. de Laat,et al.  Defining architecture components of the Big Data Ecosystem , 2014, 2014 International Conference on Collaboration Technologies and Systems (CTS).

[15]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[16]  Oscar Pastor,et al.  Defining Interaction Design Patterns to Extract Knowledge from Big Data , 2018, CAiSE.

[17]  Felix Naumann,et al.  Data Quality in Genome Databases , 2003, ICIQ.

[18]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[19]  Alessandro Campi,et al.  Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data , 2017, ER.

[20]  Michael Y. Galperin,et al.  The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes , 2017, Nucleic acids research.

[21]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[22]  Wei Wei,et al.  Modeling the Semantics of 3D Protein Structures , 2004, ER.

[23]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[24]  Oscar Pastor,et al.  Applying Conceptual Modeling to Better Understand the Human Genome , 2016, ER.

[25]  Hamidah Ibrahim,et al.  Data quality: A survey of data quality dimensions , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[26]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[27]  Felix Naumann,et al.  Quality-Driven Query Answering for Integrated Information Systems , 2002, Lecture Notes in Computer Science.

[28]  Oscar Pastor,et al.  VarSearch: Annotating Variations using an e-Genomics Framework , 2018, ENASE.

[29]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[30]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[31]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[32]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[33]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.