Schema Free Querying of Semantic Data

Developing interfaces to enable casual, non-expert users to query complex structured data has been the subject of much research over the past forty years. Since such interfaces allow users to freely query data without understanding its schema, knowing how to refer to objects, or mastering the appropriate formal query language, we call them as schema-free query interfaces. Schema-free query interface systems address a fundamental problem in NLP, Database and AI - to bridge the user conceptual world and the machine representation. However, schema-free query interface systems are challenged by three hard problems. First, we still lack a practical interface. Natural Language Interface (NLI) is easy for users but hard for machines. NLP techniques of today are still not reliable to parse out the relational structure from natural language questions. Keyword query interface, on the other hand, has limited expressiveness and ambiguity inherited from the natural language terms used as keywords. Second, people have many different ways to express or model the same meaning, which can result in the vocabulary and structure mismatches between the user's query and the machine's representation. This is often referred to as the semantic heterogeneity problem. Today we still heavily rely on ad hoc and labor-intensive approaches to deal with the semantic heterogeneity problem. Third, theWeb has seen increasing amounts of open domain semantic data with heterogeneous or unknown schemas, which daunts traditional NLI systems that require a well-defined schema. Some modern systems gave up the approach of translating the user query into a formal query at the schema level and chose to directly search into the entity network (ABox) for the matchings of the user query. This approach, however, is computational expensive and tends to have an ad hoc nature. In this thesis, we develop a novel approach to address the three hard problems. We introduce a new schema-free query interface that we call SFQ interface, in which the user explicitly specifies the relational structure of the query as a graphical 'skeleton' and annotates it with freely chosen words, phrases and entity names. By using SFQ interface, we work around the unreliable step of extracting complete relations from natural language queries. We describe a framework for interpreting these SFQ queries over open domain semantic data that automatically translates them to formal queries. First, we learn a schema statistically from the entity network. The schema itself is also represented as a network, which we call the schema network. Our mapping algorithms run on the schema network rather than the entity network, thus making it much more scalable. We define the probability of 'observing' a path on the schema network. Following it, we create two statistical association models that will be used to carry out disambiguation. Novel mapping algorithms are developed that exploit semantic similarity measures and assoication measures to address the structure and vocabulary mismatch problems. Our approach is fully computation-based, not requiring lexicons, mapping rules, domain-specific syntatic or semantic parsers, thesaurus or any hard-coded semantics. We evaluate our approach on two large datasets, DBLP+ and DBpedia. DBLP+ is a dataset we developed by augmenting the DBLP dataset with data from CiteSeerX and ArnetMiner. We created 220 SFQ queries on the DBLP+ dataset. On the other hand, we asked three human subjects who are not familiar with DBpedia to translate 33 natural language questions, coming from 2011 QALD workshop, into 99 SFQ queries on the DBpedia dataset. We carried out cross-validation on the 220 DBLP+ queries and cross-domain validation on the 99 DBpedia queries in which the parameters tuned for the DBLP+ queries are applied to the DBpedia queries. The evaluation results on the two datasets show that our system has very good efficacy and efficiency.

[1]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.

[2]  Hamish Cunningham,et al.  FREyA: An Interactive Way of Querying Linked Data Using Natural Language , 2011, ESWC Workshops.

[3]  Gary G. Hendrix,et al.  Developing a natural language interface to complex data , 1977, TODS.

[4]  D. Powers,et al.  Automatic thesaurus construction , 2008, ACSC.

[5]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[6]  Chong Wang,et al.  PANTO: A Portable Natural Language Interface to Ontologies , 2007, ESWC.

[7]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[8]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[9]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[10]  Yannis Papakonstantinou,et al.  Efficient keyword search for smallest LCAs in XML databases , 2005, SIGMOD '05.

[11]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[12]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[13]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[14]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[15]  Derrick Higgins Which Statistics Reflect Semantics? Rethinking Synonymy and Word Similarity , 2005 .

[16]  Jeffrey P. Bigham,et al.  Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems , 2003, ArXiv.

[17]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[18]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[19]  Enrico Motta,et al.  SemSearch: A Search Engine for the Semantic Web , 2006, EKAW.

[20]  Jan Snajder,et al.  TakeLab: Systems for Measuring Semantic Text Similarity , 2012, *SEMEVAL.

[21]  William C. Ogden,et al.  Query languages for the casual user: Exploring the middle ground between formal and natural languages , 1983, CHI '83.

[22]  Joseph P. Levy,et al.  Explorations in the Derivation of Semantic Representations from Word Co-occurrence Statistics , 2022 .

[23]  Timothy W. Finin,et al.  Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy , 2013, IEEE Transactions on Knowledge and Data Engineering.

[24]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[25]  Masaru Kitsuregawa,et al.  Using Hidden Markov Random Fields to Combine Distributional and Pattern-Based Word Clustering , 2008, COLING.

[26]  Enrico Motta,et al.  Cross ontology query answering on the semantic web: an initial evaluation , 2009, K-CAP '09.

[27]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[28]  Timothy W. Finin,et al.  The semantic interpretation of compound nominals , 1980 .

[29]  Anthony J. Hornof,et al.  A comparison of LSA, wordNet and PMI-IR for predicting user click behavior , 2005, CHI.

[30]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[31]  Timothy W. Finin,et al.  Schema-free structured querying of DBpedia data , 2012, CIKM.

[32]  Marianne Winslett,et al.  Using structural information in XML keyword search effectively , 2011, TODS.

[33]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[34]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[35]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[36]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[37]  Malti Patel,et al.  Extracting Semantic Representations from Large Text Corpora , 1997, NCPW.

[38]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[39]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[40]  Malti Patel,et al.  Explorations in the derivation of word co-occurrence statistics , 1999 .

[41]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.

[42]  James R. Curran,et al.  Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[43]  Oren Etzioni,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[44]  Paul Buitelaar,et al.  RelExt: A Tool for Relation Extraction from Text in Ontology Extension , 2005, SEMWEB.

[45]  H. V. Jagadish,et al.  Constructing a Generic Natural Language Interface for an XML Database , 2006, EDBT.

[46]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[47]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[48]  Ido Dagan,et al.  Contextual Word Similarity and Estimation from Sparse Data , 1993, ACL.

[49]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[50]  Graeme Hirst,et al.  Distributional measures of concept-distance: A task-oriented evaluation , 2006, EMNLP.

[51]  Roberto Navigli,et al.  Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance , 2006, ACL.

[52]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.

[53]  Frederick B. Thompson,et al.  Introducing ASK, A Simple Knowledgeable System , 1983, ANLP.

[54]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[55]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[56]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[57]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[58]  Timothy W. Finin,et al.  GoRelations: An Intuitive Query System for DBpedia , 2011, JIST.

[59]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[60]  Susan T. Dumais,et al.  The latent semantic analysis theory of knowledge , 1997 .

[61]  Enrico Motta,et al.  AquaLog: An Ontology-Portable Question Answering System for the Semantic Web , 2005, ESWC.

[62]  Oren Etzioni,et al.  Structured Querying of Web Text Data: A Technical Challenge , 2007, CIDR.

[63]  R. Rapp Word sense discovery based on sense descriptor dissimilarity , 2003, MTSUMMIT.

[64]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[65]  Ming Zhou,et al.  Synonymous Collocation Extraction Using Translation Information , 2003, ACL.

[66]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[67]  Euripides G. M. Petrakis,et al.  Semantic similarity methods in wordNet and their application to information retrieval on the web , 2005, WIDM '05.

[68]  Graeme Hirst,et al.  Computing Word-Pair Antonymy , 2008, EMNLP.

[69]  Sebastian Rudolph,et al.  Ontology-Based Interpretation of Keywords for Semantic Search , 2007, ISWC/ASWC.

[70]  Joseph P. Levy,et al.  Learning Lexical Properties from Word Usage Patterns: Which Context Words Should be Used? , 2000, NCPW.

[71]  Charles T. Meadow,et al.  Text information retrieval systems , 1992 .

[72]  Julie Elizabeth Weeds,et al.  Measures and applications of lexical distributional similarity , 2003 .

[73]  Seán O'Riain,et al.  Querying Linked Data Using Semantic Relatedness: A Vocabulary Independent Approach , 2011, NLDB.

[74]  Raymond J. Mooney,et al.  A Statistical Semantic Parser that Integrates Syntax and Semantics , 2005, CoNLL.

[75]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[76]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[77]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[78]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[79]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[80]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[81]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[82]  Philipp Cimiano,et al.  Porting natural language interfaces between domains: an experimental user study with the ORAKEL system , 2007, IUI '07.

[83]  William Tunstall-Pedoe,et al.  True Knowledge: Open-Domain Question Answering Using Structured Knowledge and Inference , 2010, AI Mag..

[84]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[85]  Orri Erling,et al.  RDF Support in the Virtuoso DBMS , 2007, CSSW.

[86]  Hsin-Hsi Chen,et al.  Novel Association Measures Using Web Search with Double Checking , 2006, ACL.

[87]  Douglas E. Appelt,et al.  TEAM: An Experiment in the Design of Transportable Natural-Language Interfaces , 1987, Artif. Intell..

[88]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[89]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[90]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[91]  Berthier A. Ribeiro-Neto,et al.  Image retrieval using multiple evidence ranking , 2004, IEEE Transactions on Knowledge and Data Engineering.

[92]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[93]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[94]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[95]  Curt Burgess,et al.  Explorations in context space: Words, sentences, discourse , 1998 .

[96]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[97]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[98]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[99]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[100]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[101]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[102]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[103]  Peter Thanisch,et al.  Natural language interfaces to databases – an introduction , 1995, Natural Language Engineering.

[104]  David J. Weir,et al.  Finding and Evaluating Sets of Nearest Neighbours , 2003 .

[105]  Jimmy J. Lin,et al.  Selectively Using Relations to Improve Precision in Question Answering , 2003 .

[106]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[107]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[108]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.