Empirical Analysis of Ranking Models for an Adaptable Dataset Search

Currently available datasets still have a large unexplored potential for interlinking. Ranking techniques contribute to this task by scoring datasets according to the likelihood of finding entities related to those of a target dataset. Ranked datasets can be either manually selected for standalone linking discovery tasks or automatically inspected by programs that would go through the ranking looking for entity links. This work presents empirical comparisons between different ranking models and argues that different algorithms could be used depending on whether the ranking is manually or automatically handled and, also, depending on the available metadata of the datasets. Experiments indicate that ranking algorithms that performed best with nDCG do not always have the best Recall at Position k, for high recall levels. The best ranking model for the manual use case (with respect to nDCG) may need 13% more datasets for 90% of recall, i.e., instead of just a slice of 34% of the datasets at the top of the ranking, reached by the best model for the automatic use case (with respect to recall@k), it would need almost 47% of the ranking.

[1]  Jintao Tang,et al.  Link prediction of datasets sameAS interlinking network on web of data , 2017, 2017 3rd International Conference on Information Management (ICIM).

[2]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[3]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[4]  Jürgen Umbrich,et al.  LDspider: An Open-source Crawling Framework for the Web of Linked Data , 2010, SEMWEB.

[5]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[6]  Maria Cláudia Reis Cavalcanti,et al.  DSCrank: A Method for Selection and Ranking of Datasets , 2016, MTSR.

[7]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.

[8]  Bernardo Pereira Nunes,et al.  Two Approaches to the Dataset Interlinking Recommendation Problem , 2014, WISE.

[9]  Bernardo Pereira Nunes,et al.  Identifying Candidate Datasets for Data Interlinking , 2013, ICWE.

[10]  Zohra Bellahsene,et al.  Beyond Established Knowledge Graphs-Recommending Web Datasets for Data Linking , 2016, ICWE.

[11]  Bernardo Pereira Nunes,et al.  Automatic Creation and Analysis of a Linked Data Cloud Diagram , 2016, WISE.

[12]  Enrico Motta,et al.  KnoFuss: a comprehensive architecture for knowledge fusion , 2007, K-CAP '07.

[13]  Zohra Bellahsene,et al.  Dataset Recommendation for Data Linking: An Intensional Approach , 2016, ESWC.

[14]  Stefan Dietze Retrieval, Crawling and Fusion of Entity-centric Data on the Web , 2016, International KEYSTONE Conference.

[15]  Bernardo Pereira Nunes,et al.  TRTML - A Tripleset Recommendation Tool Based on Supervised Learning Algorithms , 2014, ESWC.

[16]  Diego López-de-Ipiña,et al.  Detection of Related Semantic Datasets Based on Frequent Subgraph Mining , 2015, IESD@ISWC.