Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data

Many repositories of open data for genomics, collected by world-wide consortia, are important enablers of biological research; moreover, all experimental datasets leading to publications in genomics must be deposited to public repositories and made available to the research community. These datasets are typically used by biologists for validating or enriching their experiments; their content is documented by metadata. However, emphasis on data sharing is not matched by accuracy in data documentation; metadata are not standardized across the sources and often unstructured and incomplete.

[1]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[2]  Aidong Zhang,et al.  BioStar models of clinical and genomic data for biomedical data warehouse design , 2005, Int. J. Bioinform. Res. Appl..

[3]  Stefano Ceri,et al.  Framework for Supporting Genomic Operations , 2017, IEEE Transactions on Computers.

[4]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[5]  Tatiana A. Tatusova,et al.  BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata , 2011, Nucleic Acids Res..

[6]  Oscar Pastor,et al.  Applying Conceptual Modeling to Better Understand the Human Genome , 2016, ER.

[7]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[8]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[9]  Norman W. Paton,et al.  Conceptual data modelling for bioinformatics , 2002, Briefings Bioinform..

[10]  Limsoon Wong,et al.  BioKleisli: a digital library for biomedical researchers , 1997, International Journal on Digital Libraries.

[11]  Pietro Liò,et al.  The BioMart community portal: an innovative alternative to large, centralized data repositories , 2015, Nucleic Acids Res..

[12]  Marco Masseroli,et al.  GenoMetric Query Language: a novel approach to large-scale genomic data management , 2015, Bioinform..

[13]  Thomas Lengauer,et al.  DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome region sets , 2016, Nucleic Acids Res..

[14]  Alon Y. Halevy,et al.  Data integration and genomic medicine , 2007, J. Biomed. Informatics.

[15]  Shahrokh Saeednia,et al.  How to maintain both privacy and authentication in digital libraries , 2000 .

[16]  Yue Liu,et al.  CLO: The cell line ontology , 2014, Journal of Biomedical Semantics.

[17]  Richard McClatchey,et al.  Deriving Conceptual Data Models from Domain Ontologies for Bioinformatics , 2006, 2006 2nd International Conference on Information & Communication Technologies.

[18]  Limsoon Wong,et al.  A Data Transformation System for Biological Data Sources , 1995, VLDB.

[19]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[20]  Thomas Lengauer,et al.  BLUEPRINT to decode the epigenetic signature written in blood , 2012, Nature Biotechnology.

[21]  Muhammad Usman Ghani Khan,et al.  A REVIEW: CONCEPTUAL DATA MODELS FOR BIOLOGICAL DOMAIN , 2015 .

[22]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[23]  Subbarao Kambhampati,et al.  Integration of biological sources: current systems and challenges ahead , 2004, SGMD.

[24]  Martin Renqiang Min,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[25]  Marco Masseroli,et al.  Integration and Querying of Genomic and Proteomic Semantic Annotations for Biomedical Knowledge Extraction , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Eugenia Galeota,et al.  Ontology-based annotations and semantic relations in large-scale (epi)genomics data , 2016, Briefings Bioinform..

[27]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[28]  Marco Masseroli,et al.  TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas , 2017, BMC Bioinformatics.

[29]  Ramez Elmasri,et al.  Incorporating concepts for bioingormatics data modeling into EER models , 2005, The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005..

[30]  François Rechenmann Data modeling: the key to biological data integration , 2012 .

[31]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[32]  Yidong Chen,et al.  GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus , 2008, Bioinform..

[33]  Maurizio Lenzerini,et al.  Ontology-Based Search of Genomic Metadata , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  Abhishek Roy,et al.  Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study , 2017, SIGMOD Conference.

[35]  Marco Masseroli,et al.  Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. , 2016, Methods.

[36]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[37]  Junjun Zhang,et al.  BioMart Central Portal—unified access to biological data , 2009, Nucleic Acids Res..

[38]  Francisco J. Veredas,et al.  A machine learning approach for predicting methionine oxidation sites , 2017, BMC Bioinformatics.