Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

The emergence of many interlinked, physically distributed, and autonomously maintained RDF stores offers unprecedented opportunities for predictive modeling and knowledge discovery from such data. However existing machine learning approaches are limited in their applicability because it is neither desirable nor feasible to gather all of the data in a centralized location for analysis due to access, memory, bandwidth, computational restrictions, and sometimes privacy and confidentiality constraints. Against this background, we consider the problem of learning predictive models from multiple interlinked RDF stores. Specifically we: (i) introduce statistical query based formulations of several representative algorithms for learning classifiers from RDF data, (ii) introduce a distributed learning framework to learn classifiers from multiple interlinked RDF stores that form a chain, (iii) identify three special cases of RDF data fragmentation and describe effective strategies for learning predictive models in each case, (iv) consider a novel application of a matrix reconstruction technique from the field of Computerized Tomography [1] to approximate the statistics needed by the learning algorithm from projections using count queries, thus dramatically reducing the amount of information transmitted from the remote data sources to the learner, and (v) report results of experiments with a real-world social network data set (Last.fm), which demonstrate the feasibility of the proposed approach.

[1]  Gabor T. Herman,et al.  Fundamentals of Computerized Tomography: Image Reconstruction from Projections , 2009, Advances in Pattern Recognition.

[2]  Gabor T. Herman,et al.  Image reconstruction from projections : the fundamentals of computerized tomography , 1980 .

[3]  G. Herman,et al.  Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and x-ray photography. , 1970, Journal of theoretical biology.

[4]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[5]  Foster J. Provost,et al.  Distribution-based aggregation for relational learning with identifier attributes , 2006, Machine Learning.

[6]  Gerald Reif,et al.  A comparison of RDB-to-RDF mapping languages , 2011, I-Semantics '11.

[7]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  T. Minka Estimating a Dirichlet distribution , 2012 .

[10]  Vasant Honavar,et al.  Clustering remote RDF data using SPARQL update queries , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[11]  Jennifer Neville,et al.  Simple estimators for relational Bayesian classifiers , 2003, Third IEEE International Conference on Data Mining.

[12]  Graham Cormode,et al.  Node Classification in Social Networks , 2011, Social Network Data Analytics.

[13]  David W. Aha,et al.  Transforming Graph Data for Statistical Relational Learning , 2012, J. Artif. Intell. Res..

[14]  Óscar Corcho,et al.  Federating queries in SPARQL 1.1: Syntax, semantics and evaluation , 2013, J. Web Semant..

[15]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[16]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[17]  Manfred Hauswirth,et al.  Scalable distributed indexing and query processing over Linked Data , 2012, J. Web Semant..

[18]  Ahmed K. Elmagarmid,et al.  The Kluwer international series on advances in database systems , 1996 .

[19]  Vasant Honavar,et al.  Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources , 2005, Discovery Science.

[20]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[21]  Hans-Peter Kriegel,et al.  Factorizing YAGO: scalable machine learning for linked data , 2012, WWW.

[22]  Stephan Bloehdorn,et al.  Graph Kernels for RDF Data , 2012, ESWC.

[23]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[24]  Vasant Honavar,et al.  Learning Relational Bayesian Classifiers from RDF Data , 2011, SEMWEB.

[25]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.