Mapping queries to the Linking Open Data cloud: A case study using DBpedia

We introduce the task of mapping search engine queries to DBpedia, a major linking hub in the Linking Open Data cloud. We propose and compare various methods for addressing this task, using a mixture of information retrieval and machine learning techniques. Specifically, we present a supervised machine learning-based method to determine which concepts are intended by a user issuing a query. The concepts are obtained from an ontology and may be used to provide contextual information, related concepts, or navigational suggestions to the user submitting the query. Our approach first ranks candidate concepts using a language modeling for information retrieval framework. We then extract query, concept, and search-history feature vectors for these concepts. Using manual annotations we inform a machine learning algorithm that learns how to select concepts from the candidates given an input query. Simply performing a lexical match between the queries and concepts is found to perform poorly and so does using retrieval alone, i.e., omitting the concept selection stage. Our proposed method significantly improves upon these baselines and we find that support vector machines are able to achieve the best performance out of the machine learning algorithms evaluated.

[1]  M. de Rijke,et al.  Learning Semantic Query Suggestions , 2009, SEMWEB.

[2]  Amanda Spink,et al.  Defining a session on Web search engines , 2007, J. Assoc. Inf. Sci. Technol..

[3]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[4]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[5]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[6]  Stefanos D. Kollias,et al.  A String Metric for Ontology Alignment , 2005, SEMWEB.

[7]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[8]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[9]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[10]  Gerhard Weikum,et al.  Language-model-based ranking for queries on RDF-graphs , 2009, CIKM.

[11]  Edgar Meij,et al.  Investigating the Semantic Gap through Query Log Analysis , 2009, SEMWEB.

[12]  Jeffrey Xu Yu,et al.  Keyword Search in Relational Databases: A Survey , 2010, IEEE Data Eng. Bull..

[13]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[14]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[15]  Abraham Bernstein,et al.  Evaluating the usability of natural language query languages and interfaces to Semantic Web knowledge bases , 2010, J. Web Semant..

[16]  Daniel Cunliffe,et al.  Qualitative Evaluation of Thesaurus-Based Retrieval , 2002, ECDL.

[17]  Atanas Kiryakov,et al.  Towards Semantic Web Information Extraction , 2003 .

[18]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[19]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[20]  Atanas Kiryakov,et al.  Semantic Annotation, Indexing, and Retrieval , 2003, SEMWEB.

[21]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[22]  Ian Witten,et al.  Data Mining , 2000 .

[23]  Amanda Spink,et al.  From E-Sex to E-Commerce: Web Search Changes , 2002, Computer.

[24]  W. Bruce Croft,et al.  Analysis of long queries in a large scale search log , 2009, WSCD '09.

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[26]  Maarten de Rijke,et al.  Supervised query modeling using wikipedia , 2010, SIGIR '10.

[27]  Gerhard Weikum,et al.  NAGA: Searching and Ranking Knowledge , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[28]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[29]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[30]  Christos Diou,et al.  Image annotation using clickthrough data , 2009, CIVR '09.

[31]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[32]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[33]  Krisztian Balog,et al.  Entity search: building bridges between two worlds , 2010, SEMSEARCH '10.

[34]  Tim Berners-Lee,et al.  Linked data on the web (LDOW2008) , 2008, WWW.

[35]  Dunja Mladenic,et al.  Extracting Named Entities and Relating Them over Time Based on Wikipedia , 2007, Informatica.

[36]  Amanda Spink,et al.  Searching for multimedia: analysis of audio, video and image Web queries , 2000, World Wide Web.

[37]  Ji-Rong Wen,et al.  WWW 2007 / Track: Search Session: Personalization A Largescale Evaluation and Analysis of Personalized Search Strategies ABSTRACT , 2022 .

[38]  Peter Fankhauser,et al.  DivQ: diversification for keyword search over structured databases , 2010, SIGIR.

[39]  Michel C. A. Klein,et al.  Matching Unstructured Vocabularies Using a Background Ontology , 2006, EKAW.

[40]  Raymond J. Mooney,et al.  Learning to Disambiguate Search Queries from Short Sessions , 2009, ECML/PKDD.

[41]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[42]  Charles L. A. Clarke,et al.  Experiments with ClueWeb09: Relevance Feedback and Web Tracks , 2009, TREC.

[43]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[44]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[45]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[46]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[47]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[48]  Peter G. Anick,et al.  A longitudinal study of real-time search assistance adoption , 2008, SIGIR '08.

[49]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[50]  W. Bruce Croft,et al.  Discovering key concepts in verbose queries , 2008, SIGIR '08.

[51]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[52]  Véronique Malaisé,et al.  Disambiguating automatic semantic annotation based on a thesaurus structure , 2007 .

[53]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[54]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative 2007 , 2006, OM.

[55]  Gwenn Englebienne,et al.  Learning Concept Mappings from Instance Similarity , 2008, SEMWEB.

[56]  Maarten de Rijke,et al.  Search behavior of media professionals at an audiovisual archive: A transaction log analysis , 2010, J. Assoc. Inf. Sci. Technol..

[57]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[58]  Sandeep Tata,et al.  SQAK: doing more with keywords , 2008, SIGMOD Conference.

[59]  Raphael Volz,et al.  The Ontology Extraction & Maintenance Framework Text-To-Onto , 2001 .

[60]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[61]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[62]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative , 2007 .

[63]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[64]  Edgar Meij,et al.  Investigating the demand side of semantic search through query log analysis , 2009 .

[65]  Padhraic Smyth,et al.  Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning , 2008, SEMWEB.

[66]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[67]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[68]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[69]  Edgar Meij,et al.  An evaluation of entity and frequency based query completion methods , 2009, SIGIR.

[71]  Amanda Spink,et al.  Defining a session on Web search engines: Research Articles , 2007 .

[72]  Dunja Mladenic,et al.  OntoGen: Semi-automatic Ontology Editor , 2007, HCI.

[73]  M. de Rijke,et al.  Information Retrieval Support for Ontology Construction and Use , 2004, SEMWEB.

[74]  Rada Mihalcea,et al.  Linking Documents to Encyclopedic Knowledge , 2008, IEEE Intelligent Systems.