Link prediction of datasets sameAS interlinking network on web of data

In order to be considered as Linked Data, the datasets on the web must be linked to other datasets. Current studies on dataset interlinking prediction researches do not distinguish the type of links, which are of less help for real application scenarios, as dataset publishers still do not know what kinds of RDF links can be established and furthermore how to configure the data linking algorithms. In this paper, we focus on predicting the possible links between datasets with the most important RDF link type, owl:sameAs. Since the goal is to discriminate between linked dataset pairs against not-linked ones, we formulate the link prediction problem as a classification problem. We adopt Random Forest as the basic classifier to incorporate features of the scores output by unsupervised predictors, and apply the bagging technique to combine multiple forests to reduce variance and improve the accuracy. Experiments show we can improve the prediction performance by about 10% in AUROC.

[1]  Enrico Motta,et al.  What Should I Link to? Identifying Relevant Sources and Classes for Data Linking , 2011, JIST.

[2]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[3]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[4]  Deborah L. McGuinness,et al.  SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl: sameAs in Linked Data , 2010, International Semantic Web Conference.

[5]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[7]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Nitesh V. Chawla,et al.  New perspectives and methods in link prediction , 2010, KDD.

[10]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[11]  Bernardo Pereira Nunes,et al.  Two Approaches to the Dataset Interlinking Recommendation Problem , 2014, WISE.

[12]  Nitesh V. Chawla,et al.  LPmade: Link Prediction Made Easy , 2011, J. Mach. Learn. Res..

[13]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[14]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[15]  Bernardo Pereira Nunes,et al.  TRTML - A Tripleset Recommendation Tool Based on Supervised Learning Algorithms , 2014, ESWC.

[16]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[17]  Ting Wang,et al.  Collaborative Datasets Retrieval for Interlinking on Web of Data , 2015, WWW.

[18]  L. Stein,et al.  OWL Web Ontology Language - Reference , 2004 .

[19]  Bernardo Pereira Nunes,et al.  TRT - A Tripleset Recommendation Tool , 2013, International Semantic Web Conference.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.