A Method to Identify Relevant Genome Data: Conceptual Modeling for the Medicine of Precision

The use of techniques such as Next Generation Sequencing increases our knowledge about the genomic risk of suffering a certain disease, improving our ability of providing an early diagnosis and thus an appropriate treatment for each patient. In order to provide an accurate diagnosis, clinicians must perform a search in the repositories of open data available to the research community. Nevertheless, the vast amount of heterogeneous and dispersed data sources that store information about gene-disease associations as well as their variable level of quality hinder the process of determining if the variants found in the DNA sequence of a patient’s sample are clinically relevant. In this paper, we present a systematic method based on conceptual modeling and data quality management techniques to tackle the aforementioned issues with the aim of helping the genomic diagnosis of a disease. To this end, we state the most prominent problems affecting repositories of open data for genomics. Then, we use a methodological approach for identifying what we called “smart data”: the relevant information hidden in the genomics data lake. Finally, in order to test and validate the proposed method, we apply it to a case study based on the clinical diagnosis of Crohn’s Disease.

[1]  Ana León,et al.  GeIS based on Conceptual Models for the risk assessment of Neuroblastoma , 2017, 2017 11th International Conference on Research Challenges in Information Science (RCIS).

[2]  Carole A. Goble,et al.  Conceptual modelling of genomic information , 2000, Bioinform..

[3]  Oscar Pastor,et al.  From big data to smart data: A genomic information systems perspective , 2018, 2018 12th International Conference on Research Challenges in Information Science (RCIS).

[4]  Alessandro Campi,et al.  Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data , 2017, ER.

[5]  Ana León,et al.  Data Quality Problems When Integrating Genomic Information , 2016, ER Workshops.

[6]  Xosé M. Fernández-Suárez,et al.  The 2018 Nucleic Acids Research database issue and the online molecular biology database collection , 2017, Nucleic Acids Res..

[7]  Felix Naumann,et al.  Data Quality in Genome Databases , 2003, ICIQ.

[8]  John M. Hancock,et al.  Human Variome Project Quality Assessment Criteria for Variation Databases , 2016, Human mutation.

[9]  Oscar Pastor,et al.  Applying Conceptual Modeling to Better Understand the Human Genome , 2016, ER.

[10]  Wei Wei,et al.  Modeling the Semantics of 3D Protein Structures , 2004, ER.

[11]  Núria Queralt-Rosinach,et al.  DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants , 2016, Nucleic Acids Res..

[12]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..