Enriching the Web of Data with topics and links

This thesis presents novel ideas and research findings for the Web of Data – a global data space spanning many so-called Linked Open Data sources. Linked Open Data adheres to a set of simple principles to allow easy access and reuse for data published on the Web. Linked Open Data is by now an established concept and many (mostly academic) publishers adopted the principles building a powerful web of structured knowledge available to everybody. However, so far, Linked Open Data does not yet play a significant role among common Web technologies that currently facilitate a high-standard Web experience. In this work, we thoroughly discuss the state-of-the-art for Linked Open Data and highlight several shortcomings – some of them we tackle in the main part of this work. First, we propose a novel type of data source meta-information, namely the topics of a dataset. This information could be published with dataset descriptions and support a variety of use cases, such as data source exploration and selection. For the topic retrieval, we present an approach coined Annotated Pattern Percolation (APP), which we evaluate with respect to topics extracted from Wikipedia portals. Second, we contribute to entity linking research by presenting an optimization model for joint entity linking, showing its hardness, and proposing three heuristics implemented in the LINked Data Alignment (LINDA) system. Our first solution can exploit multicore machines, whereas the second and third approach are designed to run in a distributed shared-nothing environment. We discuss and evaluate the properties of our approaches leading to recommendations which algorithm to use in a specific scenario. The distributed algorithms are among the first of their kind, i.e., approaches for joint entity linking in a distributed fashion. Also, we illustrate that we can tackle the entity linking problem on the very large scale with data comprising more than 100 millions of entity representations from very many sources. Finally, we approach a sub-problem of entity linking, namely the alignment of concepts. We again target a method that looks at the data in its entirety and does not neglect existing relations. Also, this concept alignment method shall execute very fast to serve as a preprocessing for further computations. Our approach, called Holistic Concept Matching (HCM), achieves the required speed through grouping the input by comparing so-called knowledge representations. Within the groups, we perform complex similarity computations, relation conclusions, and detect semantic contradictions. The quality of our result is again evaluated on a large and heterogeneous dataset from the real Web. In summary, this work contributes a set of techniques for enhancing the current state of the Web of Data. All approaches have been tested on large and heterogeneous real-world input.

[1]  Amit P. Sheth,et al.  Linked Data Is Merely More Data , 2010, AAAI Spring Symposium: Linked Data Meets Artificial Intelligence.

[2]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[3]  Mansur R. Kabuka,et al.  Ontology matching with semantic verification , 2009, J. Web Semant..

[4]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[5]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[6]  Yannis Kalfoglou,et al.  Ontology mapping: the state of the art , 2003, The Knowledge Engineering Review.

[7]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[8]  Axel-Cyrille Ngonga Ngomo,et al.  EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming , 2012, ESWC.

[9]  Hong Cheng,et al.  Clustering large attributed information networks: an efficient incremental computing approach , 2012, Data Mining and Knowledge Discovery.

[10]  A. Arenas,et al.  Motif-based communities in complex networks , 2007, 0710.0059.

[11]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[12]  Eric Crestan,et al.  Web-scale table census and classification , 2011, WSDM '11.

[13]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Jens Lehmann,et al.  Assessing Linked Data Mappings Using Network Measures , 2012, ESWC.

[16]  Renée J. Miller,et al.  Linkage Query Writer , 2009, Proc. VLDB Endow..

[17]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[18]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[19]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[20]  Alun D. Preece,et al.  Instance Based Clustering of Semantic Web Resources , 2008, ESWC.

[21]  Xiaoyong Du,et al.  Efficient Duplicate Detection on Cloud Using a New Signature Scheme , 2011, WAIM.

[22]  Md. Mostofa Ali Patwary,et al.  A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers , 2009, PPAM.

[23]  Matias Frosterus,et al.  Creating and Publishing Semantic Metadata about Linked and Open Datasets , 2011 .

[24]  Peter Mika,et al.  Metadata Statistics for a Large Web Corpus , 2012, LDOW.

[25]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[26]  Philippe Cudré-Mauroux,et al.  dipLODocus[RDF] - Short and Long-Tail RDF Analytics for Massive Webs of Data , 2011, SEMWEB.

[27]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[28]  Deborah L. McGuinness,et al.  SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl: sameAs in Linked Data , 2010, International Semantic Web Conference.

[29]  Robert Isele,et al.  Silk Server - Adding missing Links while consuming Linked Data , 2010, COLD.

[30]  Heiner Stuckenschmidt,et al.  A Probabilistic-Logical Framework for Ontology Matching , 2010, AAAI.

[31]  Tom Heath,et al.  How to Publish Linked Data on the Web - Proposal for a Half-day Tutorial at ISWC2008 , 2008 .

[32]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[33]  T. Vicsek,et al.  Clique percolation in random networks. , 2005, Physical review letters.

[34]  Heiner Stuckenschmidt,et al.  Leveraging Terminological Structure for Object Reconciliation , 2010, ESWC.

[35]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[36]  Robert Isele,et al.  Learning Expressive Linkage Rules using Genetic Programming , 2012, Proc. VLDB Endow..

[37]  James A. Hendler,et al.  TWC LOGD: A portal for linked open government data ecosystems , 2011, J. Web Semant..

[38]  Hyoil Han,et al.  A survey on ontology mapping , 2006, SGMD.

[39]  Yuzhong Qu,et al.  How Matchable Are Four Thousand Ontologies on the Semantic Web , 2011, ESWC.

[40]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[41]  A. Maurino,et al.  Quality Assessment Methodologies for Linked Open Data , 2012 .

[42]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[43]  Jürgen Umbrich,et al.  Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine , 2011, J. Web Semant..

[44]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[45]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[46]  Peter Christen Towards Parameter-free Blocking for Scalable Record Linkage , 2007 .

[47]  John Feo,et al.  High performance semantic factoring of giga-scale semantic graph databases. , 2010 .

[48]  Elena Paslaru Bontas Simperl,et al.  Achieving Maturity: the State of Practice in Ontology Engineering in 2009 , 2010, Int. J. Comput. Sci. Appl..

[49]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[50]  Jens Lehmann,et al.  Managing the Life-Cycle of Linked Data with the LOD2 Stack , 2012, SEMWEB.

[51]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[52]  Yuzhong Qu,et al.  A self-training approach for resolving object coreference on the semantic web , 2011, WWW.

[53]  Karl Aberer,et al.  idMesh: graph-based disambiguation of linked data , 2009, WWW '09.

[54]  Samhaa R. El-Beltagy,et al.  A Survey of Ontology Learning Approaches , 2011 .

[55]  Asunción Gómez-Pérez,et al.  Methodologies, tools and languages for building ontologies: Where is their meeting point? , 2003, Data Knowl. Eng..

[56]  P. Patel-Schneider Towards Large-scale Schema And Ontology Matching , 2015 .

[57]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[58]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[59]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[60]  Claudia Niederée,et al.  Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[61]  Patrick J. Hayes,et al.  When owl: sameAs isn't the Same: An Analysis of Identity Links on the Semantic Web , 2010, LDOW.

[62]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[63]  Cong Yu,et al.  Schema summarization , 2006, VLDB.

[64]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[65]  Felix Naumann,et al.  Latent topics in graph-structured data , 2012, CIKM.

[66]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[67]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[68]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[69]  V. R. Benjamins,et al.  WonderTools? A comparative study of ontological engineering tools , 2000, Int. J. Hum. Comput. Stud..

[70]  Bin Wu,et al.  Community detection in large-scale social networks , 2007, WebKDD/SNA-KDD '07.

[71]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[72]  Andrew McCallum,et al.  An Entity Based Model for Coreference Resolution , 2009, SDM.

[73]  Amal Zouaq,et al.  A Survey of Domain Ontology Engineering: Methods and Tools , 2010, Advances in Intelligent Tutoring Systems.

[74]  Amit P. Sheth,et al.  Ontology Alignment for Linked Open Data , 2010, SEMWEB.

[75]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[76]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[77]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[78]  Andrea Giovanni Nuzzolese,et al.  Encyclopedic Knowledge Patterns from Wikipedia Links , 2011, SEMWEB.

[79]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[80]  Hector Garcia-Molina,et al.  Joint Entity Resolution , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[81]  Jianmin Wang,et al.  MapDupReducer: detecting near duplicates over massive datasets , 2010, SIGMOD Conference.

[82]  Filippo Menczer,et al.  Behavior-driven clustering of queries into topics , 2011, CIKM '11.

[83]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[84]  Jürgen Umbrich,et al.  An empirical survey of Linked Data conformance , 2012, J. Web Semant..

[85]  K. Bollacker,et al.  A Platform for Scalable, Collaborative, Structured Information Integration , 2007 .

[86]  Felix Naumann,et al.  Profiling linked open data with ProLOD , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[87]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[88]  Yi Li,et al.  RiMOM: A Dynamic Multistrategy Ontology Alignment Framework , 2009, IEEE Transactions on Knowledge and Data Engineering.

[89]  Benjamin H. Good,et al.  Performance of modularity maximization in practical contexts. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[90]  Ravi Kumar,et al.  A web of concepts , 2009, PODS.

[91]  Gjergji Kasneci,et al.  SIGMa: simple greedy matching for aligning large knowledge bases , 2012, KDD.

[92]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[93]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[94]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[95]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[96]  Mason A. Porter,et al.  Communities in Networks , 2009, ArXiv.

[97]  Berthold Reinwald,et al.  Discovering topical structures of databases , 2008, SIGMOD Conference.

[98]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative 2007 , 2006, OM.

[99]  Jean-François Boulicaut,et al.  Constraint-Based Mining of Sets of Cliques Sharing Vertex Properties , 2010 .

[100]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[101]  Robert Isele,et al.  Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[102]  Amit P. Sheth,et al.  Contextual Ontology Alignment of LOD with an Upper Ontology: A Case Study with Proton , 2011, ESWC.

[103]  Stefan Decker,et al.  Creating Semantic Web Contents with Protégé-2000 , 2001, IEEE Intell. Syst..

[104]  Jan Hidders,et al.  SERIMI - resource description similarity, RDF instance matching and interlinking , 2011, OM.

[105]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[106]  Huajun Chen,et al.  MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network , 2009, APPT.

[107]  Lei Zhang,et al.  Summary Models for Routing Keywords to Linked Data Sources , 2010, International Semantic Web Conference.

[108]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[109]  Boleslaw K. Szymanski,et al.  Overlapping community detection in networks: The state-of-the-art and comparative study , 2011, CSUR.

[110]  Felix Naumann,et al.  Scalable Iterative Graph Duplicate Detection , 2012, IEEE Transactions on Knowledge and Data Engineering.

[111]  Andreas Harth,et al.  Weaving the Pedantic Web , 2010, LDOW.

[112]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[113]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[114]  Martin Ester,et al.  Mining Cohesive Patterns from Graphs with Feature Vectors , 2009, SDM.

[115]  Guilin Qi,et al.  Zhishi.me - Weaving Chinese Linking Open Data , 2011, SEMWEB.

[116]  Brendan D. McKay,et al.  Practical graph isomorphism, II , 2013, J. Symb. Comput..

[117]  Malik Magdon-Ismail,et al.  Efficient Identification of Overlapping Communities , 2005, ISI.

[118]  Deborah L. McGuinness,et al.  owl:sameAs and Linked Data: An Empirical Study , 2010 .

[119]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[120]  Jürgen Umbrich,et al.  Data summaries for on-demand queries over linked data , 2010, WWW '10.

[121]  Cosmin Stroe,et al.  AgreementMaker: Efficient Matching for Large Real-World Schemas and Ontologies , 2009, Proc. VLDB Endow..

[122]  Jun Zhao,et al.  Describing Linked Datasets On the Design and Usage of voiD, the "Vocabulary Of Interlinked Datasets" , 2009 .

[123]  Felix Naumann,et al.  Holistic and Scalable Ontology Alignment for Linked Open Data , 2012, LDOW.

[124]  Peter Haase,et al.  An evaluation of approaches to federated query processing over linked data , 2010, I-SEMANTICS '10.

[125]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[126]  Andreas Harth,et al.  The truth is rarely pure and never simple . ” – , 2013 .

[127]  Haofen Wang,et al.  Zhishi.links results for OAEI 2011 , 2011, OM.

[128]  Jens Lehmann,et al.  Triplify: light-weight linked data publication from relational databases , 2009, WWW '09.

[129]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[130]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[131]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[132]  Michael Hausenblas,et al.  Official Statistics and the Practice of Data Fidelity , 2011 .

[133]  Carl Lagoze,et al.  The web of topics: discovering the topology of topic evolution in a corpus , 2011, WWW.

[134]  Stefan Schlobach,et al.  An Empirical Study of Instance-Based Ontology Matching , 2007, ISWC/ASWC.

[135]  Renée J. Miller,et al.  A framework for semantic link discovery over relational data , 2009, CIKM.

[136]  Theo Härder,et al.  Efficient Set Similarity Joins Using Min-prefixes , 2009, ADBIS.

[137]  Victor Muntés-Mulero,et al.  Overlapping Community Search for social networks , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[138]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[139]  Axel Polleres,et al.  Some entities are more equal than others: statistical methods to consolidate Linked Data , 2010 .

[140]  Yuzhong Qu,et al.  Falcon-AO: A practical ontology matching system , 2008, J. Web Semant..

[141]  Steve Gregory,et al.  A Fast Algorithm to Find Overlapping Communities in Networks , 2008, ECML/PKDD.

[142]  Lucas Drumond,et al.  A Survey of Ontology Learning Procedures , 2008, WONTO.

[143]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[144]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[145]  Ulf Leser,et al.  Graph-Based Ontology Construction from Heterogenous Evidences , 2009, SEMWEB.

[146]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[147]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[148]  Ulrik Brandes,et al.  On Finding Graph Clusterings with Maximum Modularity , 2007, WG.

[149]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[150]  Mohammed Bennamoun,et al.  Ontology learning from text: A look back and into the future , 2012, CSUR.

[151]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[152]  Mihalis Yannakakis,et al.  The Complexity of Multiterminal Cuts , 1994, SIAM J. Comput..

[153]  Jiawei Han,et al.  Geographical topic discovery and comparison , 2011, WWW.

[154]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[155]  Markus Freitag,et al.  GovWILD: integrating open government data for transparency , 2012, WWW.

[156]  Felix Naumann,et al.  Creating voiD descriptions for Web-scale data , 2011, J. Web Semant..

[157]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[158]  Jürgen Umbrich,et al.  Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora , 2012, J. Web Semant..

[159]  Divesh Srivastava,et al.  Summary graphs for relational database schemas , 2011, Proc. VLDB Endow..

[160]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[161]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[162]  Mohammed J. Zaki,et al.  Mining Attribute-structure Correlated Patterns in Large Attributed Graphs , 2012, Proc. VLDB Endow..

[163]  Michael Hausenblas,et al.  Describing linked datasets with the VoID vocabulary , 2011 .

[164]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[165]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.