Classification of Linked Data Sources Using Semantic Scoring

Linked data sets are created using semantic Web technologies and they are usually big and the number of such datasets is growing. The query execution is therefore costly, and knowing the content of data in such datasets should help in targeted querying. Our aim in this paper is to classify linked data sets by their knowledge content. Earlier projects such as LOD Cloud, LODStats, and SPARQLES analyze linked data sources in terms of content, availability and infrastructure. In these projects, linked data sets are classified and tagged principally using VoID vocabulary and analyzed according to their content, availability and infrastructure. Although all linked data sources listed in these projects appear to be classified or tagged, there are a limited number of studies on automated tagging and classification of newly arriving linked data sets. Here, we focus on automated classification of linked data sets using semantic scoring methods. We have collected the SPARQL endpoints of 1,328 unique linked datasets from Datahub, LOD Cloud, LODStats, SPARQLES, and SpEnD projects. We have then queried textual descriptions of resources in these data sets using their rdfs:comment and rdfs:label property values. We analyzed these texts in a similar manner with document analysis techniques by assuming every SPARQL endpoint as a separate document. In this regard, we have used WordNet semantic relations library combined with an adapted term frequency-inverted document frequency (tfidf) analysis on the words and their semantic neighbours. In WordNet database, we have extracted information about comment/label objects in linked data sources by using hypernym, hyponym, homonym, meronym, region, topic and usage semantic relations. We obtained some significant results on hypernym and topic semantic relations; we can find words that identify data sets and this can be used in automatic classification and tagging of linked data sources. By using these words, we experimented different classifiers with different scoring methods, which results in better classification accuracy results. key words: linked data, semantic classification, wordnet

[1]  W. Marsden I and J , 2012 .

[2]  Kasim Oztoprak,et al.  Profiling subscribers according to their internet usage characteristics and behaviors , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[3]  Axel-Cyrille Ngonga Ngomo,et al.  Detecting Similar Linked Datasets Using Topic Modelling , 2016, ESWC.

[4]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[5]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[6]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[9]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[10]  C. Lee Giles,et al.  A generalized topic modeling approach for automatic document annotation , 2015, International Journal on Digital Libraries.

[11]  Huan Liu,et al.  Incremental Feature Selection , 1998, Applied Intelligence.

[12]  Kasim Oztoprak,et al.  Subscriber Profiling for Connection Service Providers by Considering Individuals and Different Timeframes , 2016, IEICE Trans. Commun..

[13]  Maribel Acosta,et al.  ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints , 2011, SEMWEB.

[14]  Olaf Hartig,et al.  SQUIN: a traversal based query execution system for the web of linked data , 2013, SIGMOD '13.

[15]  Halife Kodaz,et al.  SpEnD: Linked Data SPARQL Endpoints Discovery Using Search Engines , 2016, IEICE Trans. Inf. Syst..

[16]  Muhammad Saleem,et al.  HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation , 2014, ESWC.

[17]  Andrea Marino,et al.  Topical clustering of search results , 2012, WSDM '12.

[18]  Heiko Paulheim,et al.  Towards Automatic Topical Classification of LOD Datasets , 2015, LDOW@WWW.

[19]  Erdogan Dogdu,et al.  Identifying trolls and determining terror awareness level in social networks using a scalable framework , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[20]  Jens Lehmann,et al.  LODStats - An Extensible Framework for High-Performance Dataset Analytics , 2012, EKAW.

[21]  Alfio Ferrara,et al.  Linked data classification: a feature-based approach , 2013, EDBT '13.

[22]  Amit P. Sheth,et al.  Automatic Domain Identification for Linked Open Data , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[23]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[26]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[27]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[28]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.