StaTIX — Statistical Type Inference on Linked Data

Large knowledge bases typically contain data adhering to various schemas with incomplete and/or noisy type information. This seriously complicates further integration and post-processing efforts, as type information is crucial in correctly handling the data. In this paper, we introduce a novel statistical type inference method, called StaTIX, to effectively infer instance types in Linked Data sets in a fully unsupervised manner. Our inference technique leverages a new hierarchical clustering algorithm that is robust, highly effective, and scalable. We introduce a novel approach to reduce the processing complexity of the similarity matrix specifying the relations between various instances in the knowledge base. This approach speeds up the inference process while also improving the correctness of the inferred types due to the noise attenuation in the input data. We further optimize the clustering process by introducing a dedicated hash function that speeds up the inference process by orders of magnitude without negatively affecting its accuracy. Finally, we describe a new technique to identify representative clusters from the multi-scale output of our clustering algorithm to further improve the accuracy of the inferred types. We empirically evaluate our approach on several real-world datasets and compare it to the state of the art. Our results show that StaTIX is more efficient than existing methods (both in terms of speed and memory consumption) as well as more effective. StaTIX reduces the F1-score error of the predicted types by about 40% on average compared to the state of the art and improves the execution time by orders of magnitude.

[1]  Kenza Kellou-Menouer,et al.  Schema Discovery in RDF Data Sources , 2015, ER.

[2]  M. Newman Community detection in networks: Modularity optimization and maximum likelihood are equivalent , 2016, Physical review. E.

[3]  Andrea Lancichinetti,et al.  Detecting the overlapping and hierarchical community structure in complex networks , 2008, 0802.1218.

[4]  Herman J. ter Horst,et al.  Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary , 2005, J. Web Semant..

[5]  Johanna Völker,et al.  Type Prediction in RDF Knowledge Bases Using Hierarchical Multilabel Classification , 2016, WIMS.

[6]  Jens Lehmann,et al.  DL-Learner - A framework for inductive learning on the Semantic Web , 2016, J. Web Semant..

[7]  Jens Lehmann,et al.  Quality assessment for Linked Data: A Survey , 2015, Semantic Web.

[8]  Ondrej Sváb-Zamazal,et al.  LHD 2.0: A text mining approach to typing entities in knowledge graphs , 2016, J. Web Semant..

[9]  Heiko Paulheim,et al.  Type Inference on Noisy RDF Data , 2013, SEMWEB.

[10]  Heiner Stuckenschmidt,et al.  Automated Fine-Grained Trust Assessment in Federated Knowledge Bases , 2017, International Semantic Web Conference.

[11]  Mark E. J. Newman,et al.  Community detection in networks: Modularity optimization and maximum likelihood are equivalent , 2016, ArXiv.

[12]  Manolis Koubarakis,et al.  RDFS Reasoning and Query Answering on Top of DHTs , 2008, SEMWEB.

[13]  Josep-Lluís Larriba-Pey,et al.  High quality, scalable and parallel community detection for large real graphs , 2014, WWW.

[14]  Mark E. J. Newman,et al.  Spectral methods for network community detection and graph partitioning , 2013, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Jens Lehmann,et al.  Distributed Semantic Analytics Using the SANSA Stack , 2017, SEMWEB.

[16]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[17]  Simone Paolo Ponzetto,et al.  A Probabilistic Approach for Integrating Heterogeneous Knowledge Sources , 2014, ESWC.

[18]  Santo Fortunato,et al.  Consensus clustering in complex networks , 2012, Scientific Reports.

[19]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[20]  Xiang Zhang,et al.  Predicting Object Types in Linked Data by Text Classification , 2017, 2017 Fifth International Conference on Advanced Cloud and Big Data (CBD).

[21]  Heiko Paulheim,et al.  Improving the Quality of Linked Data Using Statistical Distributions , 2014, Int. J. Semantic Web Inf. Syst..

[22]  Jure Leskovec,et al.  Overlapping community detection at scale: a nonnegative matrix factorization approach , 2013, WSDM.

[23]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[24]  Zhisheng Huang,et al.  Reasoning with Noisy Semantic Data , 2011, ESWC.

[25]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[26]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  Gianluca Demartini,et al.  Combining inverted indices and structured search for ad-hoc object retrieval , 2012, SIGIR '12.

[28]  Kenza Kellou-Menouer,et al.  Evaluating the Gap Between an RDF Dataset and Its Schema , 2015, ER Workshops.