Flexible data integration and curation using a graph-based approach

MOTIVATION The increasing diversity of data available to the biomedical scientist holds promise for better understanding of diseases and discovery of new treatments for patients. In order to provide a complete picture of a biomedical question, data from many different origins needs to be combined into a unified representation. During this data integration process, inevitable errors and ambiguities present in the initial sources compromise the quality of the resulting data warehouse, and greatly diminish the scientific value of the content. Expensive and time-consuming manual curation is then required to improve the quality of the information. However, it becomes increasingly difficult to dedicate and optimize the resources for data integration projects as available repositories are growing both in size and in number everyday. RESULTS We present a new generic methodology to identify problematic records, causing what we describe as 'data hairball' structures. The approach is graph-based and relies on two metrics traditionally used in social sciences: the graph density and the betweenness centrality. We evaluate and discuss these measures and show their relevance for flexible, optimized and automated data curation and linkage. The methodology focuses on information coherence and correctness to improve the scientific meaningfulness of data integration endeavors, such as knowledge bases and large data warehouses. CONTACT samuel.croset@roche.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[2]  A Wajda,et al.  Record Linkage Strategies , 1991, Methods of Information in Medicine.

[3]  Camille Laibe Identifiers.org and MIRIAM Registry: perennial identifiers for crossreferencing purposes , 2011 .

[4]  R. Doyle The American terrorist. , 2001, Scientific American.

[5]  Nicolas Le Novère,et al.  Identifiers.org and MIRIAM Registry: community resources to provide persistent identification , 2011, Nucleic Acids Res..

[6]  William E. Winkler 20. Matching and Record Linkage , 2011 .

[7]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[8]  A Wajda,et al.  Record Linkage Strategies: Part II. Portable Software and Deterministic Matching , 1991, Methods of Information in Medicine.

[9]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[10]  E. Perakslis,et al.  Effective knowledge management in translational medicine , 2010, Journal of Translational Medicine.

[11]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[12]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[13]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[14]  Pekka Tiikkainen,et al.  Estimating Error Rates in Bioactivity Databases , 2013, J. Chem. Inf. Model..

[15]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[16]  Antony J. Williams,et al.  Parallel Worlds of Public and Commercial Bioactive Chemistry Data , 2014, Journal of medicinal chemistry.

[17]  D. Randall Wilson,et al.  Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage , 2011, The 2011 International Joint Conference on Neural Networks.

[18]  William E. Winkler,et al.  Matching and record linkage , 2011 .

[19]  Antoine Dutot,et al.  GraphStream: A Tool for bridging the gap between Complex Systems and Dynamic Graphs , 2008, ArXiv.

[20]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[21]  Sean Ekins,et al.  Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. , 2012, Drug discovery today.

[22]  Jane Kidd,et al.  Life after statin patent expiries , 2006, Nature Reviews Drug Discovery.

[23]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[24]  Egon L. Willighagen,et al.  Scientific Lenses to Support Multiple Views over Linked Chemistry Data , 2014, SEMWEB.