A Ranking-Based Approach to Discover Semantic Associations Between Linked Data Marı́a -

Under the umbrella of the Semantic Web, Linked Data projects have the potential to discover links between datasets and make available a large number of semantically inter-connected data. Particularly, Health Care and Life Sciences have taken advantage of this research area, and publicly hyper-connected data about disorders and disease genes, drugs and clinical trials, are accessible on the Web. In addition, existing health care domain ontologies are usually comprised of large sets of facts, which have been used to annotate scientific data. For instance, annotations of controlled vocabularies such as MeSH or UMLS, describe the topics treated in PubMed publications, and these annotations have been successfully used to discover associations between drugs and diseases in the context of the Literature-Based Discovery area. However, given the size of the linked datasets, users have to spend uncountable hours or days, to traverse the links before identifying a new discovery. In this paper we provide an authority-flow based ranking technique that is able to assign high scores to terms that correspond to potential novel discoveries, and to efficiently identify these highly scored terms. We propose a graph-sampling method that models linked data as a Bayesian network and implements a Direct Sampling reasoning algorithm to approximate the ranking scores of the network. An initial experimental study reveals that our ranking techniques are able to reproduce state-of-the-art discoveries; additionally, the sampling-based approach is able to reduce the exact solution evaluation time.

[1]  Philip S. Yu,et al.  Mining, Indexing, and Similarity Search in Graphs and Complex Structures , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[3]  Renée J. Miller,et al.  LinkedCT: A Linked Data Space for Clinical Trials , 2009, ArXiv.

[4]  Maria-Esther Vidal,et al.  Ranking target objects of navigational queries , 2006, WIDM '06.

[5]  Jeffrey F. Naughton,et al.  Query size estimation by adaptive sampling (extended abstract) , 1990, PODS.

[6]  Lise Getoor,et al.  Introduction to the special issue on link mining , 2005, SKDD.

[7]  Jiawei Han,et al.  Mining coherent dense subgraphs across massive biological networks for functional discovery , 2005, ISMB.

[8]  P. Srinivasan,et al.  Mining MEDLINE: Postulating a Beneficial Role for Curcumin Longa in Retinal Diseases , 2004, HLT-NAACL 2004.

[9]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[10]  D. Swanson Migraine and Magnesium: Eleven Neglected Connections , 2015, Perspectives in biology and medicine.

[11]  Wei Sun,et al.  A supplement to sampling-based methods for query size estimation in a database system , 1992, SGMD.

[12]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[13]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[14]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[15]  Amit P. Sheth,et al.  Discovering and Ranking Semantic Associations over a Large RDF Metabase , 2004, VLDB.

[16]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..