Learning classifiers from remote RDF data stores augmented with RDFS subclass hierarchies

Rapid growth of RDF data in the Linked Open Data (LOD) cloud offers unprecedented opportunities for analyzing such data using machine learning algorithms. The massive size and distributed nature of LOD cloud present a challenging machine learning problem where the data can only be accessed remotely, i.e. through a query interface such as the SPARQL end-point of the data store. Existing approaches to learning classifiers from RDF data in such a setting fail to take advantage of RDF schema (RDFS) associated with the data store that asserts subclass hierarchies which provide information that can potentially be exploited by the learner. Against this background, we present a general approach that augments an existing directed graphical model with hidden variables that encode subclass hierarchies via probabilistic constraints. We also present an algorithm ProbAVT that adopts the variational Bayesian expectation maximization approach to efficiently learn parameters in such settings. Our experiments with several synthetic and real world datasets show that: (i) ProbAVT matches or outperforms its counterpart that does not incorporate background knowledge in the form of subclass hierarchies; (ii) ProbAVT remains competitive compared to other state-of-art models that incorporate subclass hierarchies, and is able to scale up to large hierarchies consisting of over tens of thousands of nodes.

[1]  William W. Cohen,et al.  Block-LDA: Jointly Modeling Entity-Annotated Text and Entity-Entity Links , 2014, Handbook of Mixed Membership Models and Their Applications.

[2]  Vasant Honavar,et al.  Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores , 2013, 2013 IEEE International Congress on Big Data.

[3]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[4]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[5]  Glenn Fung,et al.  Knowledge-Based Support Vector Machine Classifiers , 2002, NIPS.

[6]  Gjergji Kasneci,et al.  Automated feature generation from structured knowledge , 2011, CIKM '11.

[7]  Vasant Honavar,et al.  Multinomial Event Model Based Abstraction for Sequence and Text Classification , 2005, SARA.

[8]  Achim Rettinger,et al.  Statistical Relational Learning with Formal Ontologies , 2009, ECML/PKDD.

[9]  Jennifer Neville,et al.  Simple estimators for relational Bayesian classifiers , 2003, Third IEEE International Conference on Data Mining.

[10]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[11]  Hans-Peter Kriegel,et al.  A scalable approach for statistical learning in semantic graphs , 2014, Semantic Web.

[12]  Bruce G. Buchanan,et al.  Ontology-guided knowledge discovery in databases , 2001, K-CAP '01.

[13]  Vasant Honavar,et al.  Learning Relational Bayesian Classifiers from RDF Data , 2011, SEMWEB.

[14]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[15]  William W. Cohen,et al.  Block-LDA: Jointly Modeling Entity-Annotated Text and Entity-Entity Links , 2014, Handbook of Mixed Membership Models and Their Applications.

[16]  Jude W. Shavlik,et al.  Knowledge-Based Kernel Approximation , 2004, J. Mach. Learn. Res..

[17]  Sebastian Rudolph,et al.  Foundations of Semantic Web Technologies , 2009 .

[18]  Vasant Honavar,et al.  Learning decision tree classifiers from attribute value taxonomies and partially specified data , 2003, ICML 2003.

[19]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[20]  C. Elkan,et al.  Topic Models , 2008 .

[21]  Vasant Honavar,et al.  Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data , 2006, Knowledge and Information Systems.

[22]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[23]  Thanh Tran,et al.  Relational Kernel Machines for Learning from Graph-Structured RDF Data , 2011, ESWC.

[24]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[25]  Vasant Honavar,et al.  On the utility of abstraction in labeling actors in social networks , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[26]  Glenn Fung,et al.  Knowledge-Based Nonlinear Kernel Classifiers , 2003, COLT.

[27]  Abraham Bernstein,et al.  Adding Data Mining Support to SPARQL Via Statistical Relational Learning Methods , 2008, ESWC.

[28]  Matthew J. Beal,et al.  Variational Bayesian learning of directed graphical models with hidden variables , 2006 .

[29]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[30]  David M. Blei,et al.  Connections between the lines: augmenting social networks with text , 2009, KDD.