Populating the Semantic Web: Combining Text and Relational Databases as RDF

The Semantic Web promises a way of linking distributed information at a granular level by interconnecting compact data items instead of complete HTML pages. New data is gradually being added to the Semantic Web but there is a need to incorporate existing knowledge. This thesis explores ways to convert a coherent body of information from various structured and unstructured formats into the necessary graph form. The transformation work crosses several currently active disciplines, and there are further research questions that can be addressed once the graph has been built. Hybrid databases, such as the cultural heritage one used here, consist of structured relational tables associated with free text documents. Access to the data is hampered by complex schemas, confusing terminology and difficulties in searching the text effectively. This thesis describes how hybrid data can be unified by assembly into a graph. A major component task is the conversion of relational database content to RDF. This is an active research field, to which this work contributes by examining weaknesses in some existing methods and proposing alternatives. The next significant element of the work is an attempt to extract structure automatically from English text using natural language processing methods. The first claim made is that the semantic content of the text documents can be adequately captured as a set of binary relations forming a directed graph. It is shown that the data can then be grounded using existing domain thesauri, by building an upper ontology structure from these. A schema for cultural heritage data is proposed, intended to be generic for that domain and as compact as possible. Another hypothesis is that use of a graph will assist retrieval. The structure is uniform and very simple, and the graph can be queried even if the predicates (or edge labels) are unknown. Additional benefits of the graph structure are examined, such as using path length between nodes as a measure of relatedness (unavailable in a relational database where there is no equivalent concept of locality), and building information summaries by grouping the attributes of nodes that share predicates. These claims are tested by comparing queries across the original and the new data structures. The graph must be able to answer correctly queries that the original database dealt with, and should also demonstrate valid answers to queries that could not previously be answered or where the results were incomplete.

[1]  Claudio Gutierrez,et al.  RDF Query Languages Need Support for Graph Properties , 2004 .

[2]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[3]  Lauren B. Doyle,et al.  Indexing and abstracting by association. Part I , 1997 .

[4]  Ann Macintosh,et al.  Knowledge Asset Road Maps , 1998, PAKM.

[5]  Amit P. Sheth,et al.  Ρ-Queries: enabling querying for semantic associations on the semantic web , 2003, WWW '03.

[6]  Baohua Gu Recognizing Nested Named Entities in GENIA corpus , 2006, BioNLP@NAACL-HLT.

[7]  Leo Sauermann,et al.  Cool URIs for the semantic web , 2007 .

[8]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[9]  Paola Velardi,et al.  Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites , 2004, CL.

[10]  David Milward,et al.  Ontology-Based Interactive Information Extraction From Scientific Abstracts , 2005, Comparative and functional genomics.

[11]  Steffen Staab,et al.  Triple Client WonderWeb : Ontology Infrastructure for the Semantic Web , 2003 .

[12]  Robert Meersman Semantic Web and Ontologies: Playtime or Business at the Last Frontier in Computing , 2002 .

[13]  Nigel Shadbolt,et al.  Resource Description Framework (RDF) , 2009 .

[14]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[15]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[16]  Douglas Tudhope,et al.  Semantic Interoperability in Archaeological Datasets: Data Mapping and Extraction Via the CIDOC CRM , 2008, ECDL.

[17]  Claire Grover,et al.  Rule-Based Chunking and Reusability , 2006, LREC.

[18]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[19]  Andy Seaborne,et al.  SPARQL/Update: A language for updating RDF graphs , 2007 .

[20]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[21]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[22]  Ben Hachey Comparison of Similarity Models for the Relation Discovery Task , 2006 .

[23]  Wendy Hall,et al.  The Semantic Web Revisited , 2006, IEEE Intelligent Systems.

[24]  Frank van Harmelen,et al.  Exploring large document repositories with RDF technology: the DOPE project , 2004, IEEE Intelligent Systems.

[25]  Vassilis Christophides,et al.  RQL: a declarative query language for RDF , 2002, WWW.

[26]  E. F. Codd,et al.  The Relational Model for Database Management, Version 2 , 1990 .

[27]  Nigel Collier,et al.  The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers , 1999, EACL.

[28]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[29]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[30]  Mike Uschold,et al.  Building Ontologies: Towards a Unified Methodology , 1996 .

[31]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[32]  Claire Cardie,et al.  Joint Extraction of Entities and Relations for Opinion Recognition , 2006, EMNLP.

[33]  John Gifford,et al.  Dumfries and Galloway , 1996 .

[34]  Ralph Grishman,et al.  Machine Learning of Extraction Patterns from Unannotated Corpora: Position Statement , 2000 .

[35]  Jennifer Neville,et al.  Statistical Relational Learning: Four Claims and a Survey , 2003 .

[36]  Jeremy J. Carroll,et al.  Named graphs, provenance and trust , 2005, WWW '05.

[37]  Ellen Riloff,et al.  Extraction-based Text Categorization: Generating Domain-specific Role Relationships , 1999 .

[38]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[39]  Paul Buitelaar,et al.  RelExt: A Tool for Relation Extraction from Text in Ontology Extension , 2005, SEMWEB.

[40]  Alfonso Valencia,et al.  Automatic ontology construction from the literature. , 2002, Genome informatics. International Conference on Genome Informatics.

[41]  Borys Omelayenko,et al.  Semantic Excavation of the City of Books , 2007, SAAKM.

[42]  Pat Armstrong,et al.  People in Organisations , 1985 .

[43]  Amy Isard,et al.  Speaking the Users' Languages , 2003, IEEE Intell. Syst..

[44]  Ewan Klein,et al.  Genic interaction extraction with semantic and syntactic chains , 2005 .

[45]  Harith Alani Spatial and Thematic Ontology in Cultural Heritage Information Systems , 2001 .

[46]  Johan Bos Towards Wide-Coverage Semantic Interpretation , 2005 .

[47]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[48]  Thomas B. Passin,et al.  Explorer's guide to the semantic web , 2004 .

[49]  Nicholas Gibbins,et al.  3store: Efficient Bulk RDF Storage , 2003, PSSS.

[50]  J Allan,et al.  Readings in information retrieval. , 1998 .

[51]  Véronique Malaisé,et al.  A Method to Convert Thesauri to SKOS , 2006, ESWC.

[52]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[53]  E. Dura Natural Language in Information Retrieval , 2003, CICLing.

[54]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[55]  Keith L. Clark,et al.  Negation as Failure , 1987, Logic and Data Bases.

[56]  B S Jacobsen Methodology corner: know thy data. , 1981, Nursing research.

[57]  Dania Bilal,et al.  Differences and similarities in information seeking: children and adults as Web users , 2002, Inf. Process. Manag..

[58]  Arthur-Jean Held The 18th century , 1989 .

[59]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[60]  Alberto O. Mendelzon,et al.  The G+/GraphLog Visual Query System , 1990, SIGMOD '90.

[61]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[62]  S. Decker The Semantic Web-on the respective Roles of XML and RDF , 2000 .

[63]  Claudio Gutiérrez,et al.  Bipartite Graphs as Intermediate Model for RDF , 2004, SEMWEB.

[64]  Kate Byrne Tethering Cultural Data with RDF , 2006 .

[65]  Roberto Navigli,et al.  An analysis of ontology-based query expansion strategies , 2003 .

[66]  Eero Hyvönen,et al.  Building a National Semantic Web Ontology and Ontology Service Infrastructure -The FinnONTO Approach , 2008, ESWC.

[67]  Lee Feigenbaum,et al.  The Semantic Web in action. , 2007, Scientific American.

[68]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[69]  Shan-Hwei Nienhuys-Cheng,et al.  Foundations of Inductive Logic Programming , 1997, Lecture Notes in Computer Science.

[70]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[71]  Ralph Grishman,et al.  Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition , 1998, VLC@COLING/ACL.

[72]  Kokou Yétongnon,et al.  DB2OWL : A Tool for Automatic Database-to-Ontology Mapping , 2007, SEBD.

[73]  Lyle H. Ungar,et al.  Statistical Relational Learning for Link Prediction , 2003 .

[74]  Andy Seaborne,et al.  Three Implementations of SquishQL, a Simple RDF Query Language , 2002, SEMWEB.

[75]  Marti A. Hearst Clustering versus faceted categories for information exploration , 2006, Commun. ACM.

[76]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[77]  Andreas Harth,et al.  TRIPLE - an RDF Rule Language with Context and Use Cases , 2005, Rule Languages for Interoperability.

[78]  Christian Bizer,et al.  D2R Server - Publishing Relational Databases on the Semantic Web , 2004 .

[79]  Marieke van Erp,et al.  Cleaning and Enriching Research Data on Reptiles and Amphibians. The MITCH Pilot Project and "nulmeting" Induction of Linguistic Knowledge Research Group Technical Report ILK 06-01 , 2006 .

[80]  Balakrishnan Chandrasekaran,et al.  What are ontologies, and why do we need them? , 1999, IEEE Intell. Syst..

[81]  Simon Buckingham Shum,et al.  Knowledge Representation with Ontologies: The Present and Future , 2004, IEEE Intell. Syst..

[82]  Asunción Gómez-Pérez,et al.  R2O, an extensible and semantically based database-to-ontology mapping language , 2004 .

[83]  Nigel Shadbolt,et al.  SPARQL Query Processing with Conventional Relational Database Systems , 2005, WISE Workshops.

[84]  Alun D. Preece,et al.  Better Knowledge Management through Knowledge Engineering , 2001, IEEE Intell. Syst..

[85]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[86]  P Palladino And the answer is ... 42. , 2000, Social history of medicine : the journal of the Society for the Social History of Medicine.

[87]  Dieter Fensel,et al.  Unifying Reasoning and Search to Web Scale , 2007, IEEE Internet Computing.

[88]  Koby Crammer,et al.  Flexible Text Segmentation with Structured Multilabel Classification , 2005, HLT.

[89]  S Williams The associative model of data , 2001 .

[90]  Beatrice Alex,et al.  Recognising Nested Named Entities in Biomedical Text , 2007, BioNLP@ACL.

[91]  Yuxin Mao,et al.  Dartgrid : a Semantic Web Toolkit for Integrating Heterogeneous Relational Databases , 2006 .

[92]  David L. Davidson,et al.  The Logical Form of Action Sentences , 2001 .

[93]  Fast semi-automatic generation of ontologies and their exploitation , 2004 .

[94]  Eero Hyvönen,et al.  CultureSampo-Finnish Culture on the Semantic Web: The Vision and First Results , 2007 .

[95]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[96]  David Heckerman,et al.  Probabilistic Models for Relational Data , 2004 .

[97]  J. Golden,et al.  Early Bronze Age , 2002 .

[98]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[99]  Mark Steedman,et al.  Wide-Coverage Semantic Representations from a CCG Parser , 2004, COLING.

[100]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[101]  Rob Malouf,et al.  Markov Models for Language-independent Named Entity Recognition , 2002, CoNLL.

[102]  Andrew McCallum,et al.  Composition of Conditional Random Fields for Transfer Learning , 2005, HLT.

[103]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[104]  EVACUATION BULLETS,et al.  Atlantic City , 1926, Journal of the National Medical Association.

[105]  Eugene Inseok Chong,et al.  An Efficient SQL-based RDF Querying Scheme , 2005, VLDB.

[106]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[107]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[108]  Andrew Smith,et al.  Using Gazetteers in Discriminative Information Extraction , 2006, CoNLL.

[109]  Robert Meersman,et al.  Ontologies and Databases: More than a Fleeting Resemblance , 2002 .

[110]  Olga Uryupina Evaluating Name-Matching for Coreference Resolution , 2004, LREC.

[111]  John Riley,et al.  Tim Berners-Lee , 1998 .

[112]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[113]  Trevor Cohn,et al.  Logarithmic Opinion Pools for Conditional Random Fields , 2005, ACL.

[114]  Ellen M. Voorhees,et al.  Overview of TREC 2003 , 2003, TREC.

[115]  Ian Horrocks,et al.  The Semantic Web: The Roles of XML and RDF , 2000, IEEE Internet Comput..

[116]  David Milward,et al.  From Information Retrieval to Information Extraction , 2000 .

[117]  Barry Haddow,et al.  The Extraction of Enriched Protein-Protein Interactions from Biomedical Text , 2007, BioNLP@ACL.

[118]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[119]  Razvan Bunescu and Raymond J. Mooney Relational Markov Networks for Collective Information Extraction , 2004 .

[120]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[121]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[122]  Eero Hyvönen,et al.  Elements of a National SemanticWeb Infrastructure--Case Study Finland on the Semantic Web , 2007, International Conference on Semantic Computing (ICSC 2007).

[123]  Xiaoyan Zhu,et al.  Discovering Patterns to Extract Protein-Protein Interactions from Full Biomedical Texts , 2004, NLPBA/BioNLP.

[124]  Maria Vargas-Vera,et al.  Event Recognition on News Stories and Semi-Automatic Population of an Ontology , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[125]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[126]  Martin L. King,et al.  Towards a Methodology for Building Ontologies , 1995 .

[127]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[128]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[129]  Aldo Gangemi,et al.  Ontology Learning and Its Application to Automated Terminology Translation , 2003, IEEE Intell. Syst..

[130]  Frank van Harmelen,et al.  A semantic web primer , 2004 .

[131]  Claire Grover,et al.  Adapting a Relation Extraction Pipeline for the BioCreAtIvE II Tasks , 2007 .

[132]  D. V. Clarke,et al.  Symbols of Power at the Time of Shonehenge , 1985 .

[133]  Claudio Gutiérrez,et al.  Querying RDF Data from a Graph Database Perspective , 2005, ESWC.

[134]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[135]  Zhisheng Huang,et al.  MultimediaN E-Culture Demonstrator , 2006, International Semantic Web Conference.

[136]  Eduardo Mena,et al.  Automatic Ontology Construction for a Multiagent-Based Software Gathering Service , 2000, CIA.

[137]  William E. Poor STAIRS: A Storage and Retrieval System Applied in Online Cataloging. , 1982 .

[138]  Elena Not,et al.  Generating Multilingual Personalized Descriptions of Museum Exhibits - The M-PIRO Project , 2001, ArXiv.

[139]  Leif Arda Nielsen,et al.  Extracting Protein-Protein interactions using simple contextual features , 2006, BioNLP@NAACL-HLT.

[140]  Dieter Fensel,et al.  SEMANTIC WEB LANGUAGES – STRENGTHS AND WEAKNESS , 2003 .

[141]  Gilad Mishne,et al.  Preprocessing documents to answer Dutch questions , 2003 .

[142]  Le Zhang,et al.  Filtering Junk Mail with a Maximum Entropy Model , 2003 .

[143]  Lynette Hirschman,et al.  Natural language question answering: the view from here , 2001, Natural Language Engineering.

[144]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[145]  Sam Ruby,et al.  RESTful Web Services , 2007 .