Exploiting syntactic structure of queries in a language modeling approach to IR

Natural Language Processing (NLP) techniques have been explored to enhance the performance of Information Retrieval (IR) methods with varied results. Most efforts in using NLP techniques have been to identify better index terms for representing documents. This use in the indexing phase of IR has implicit effect on retrieval performance. However, the explicit use of NLP techniques during the retrieval or information seeking phase has been restricted to interactive or dialogue systems. Recent advances in IR are based on using Statistical Language Models (SLM) to represent documents and ranking them based on their model generating a given user query. This paper presents a novel method for using NLP techniques on user queries, specifically, a syntactic parse of a query, in the statistical language modeling approach to IR. In the proposed method, named Concept Language Models, a query is viewed as a sequence of concepts and a concept as a sequence terms. The paper presents different approximations to estimate the concept and term probabilities and compute the query likelihood estimate for documents. Some empirical results on TREC test collections comparing Concept Language Models with smoothed N-gram language models are presented.

[1]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[2]  Frederick Jelinek,et al.  Exploiting Syntactic Structure for Language Modeling , 1998, ACL.

[3]  Karen Sparck Jones What is the Role of NLP in Text Retrieval , 1999 .

[4]  Yasushi Ogawa,et al.  The use of phrases from query texts in information retrieval (poster session) , 2000, SIGIR '00.

[5]  Rohini K. Srihari,et al.  Biterm language models for document retrieval , 2002, SIGIR '02.

[6]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.

[7]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[8]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[9]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[10]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[11]  James Allan,et al.  Capturing Term Dependencies using a Sentence Tree based Language Model , 2002 .

[12]  James Allan,et al.  Capturing term dependencies using a language model based on sentence trees , 2002, CIKM '02.

[13]  Wei Li,et al.  A Question Answering System Supported by Information Extraction , 2000, ANLP.

[14]  Djoerd Hiemstra,et al.  Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[15]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[16]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[17]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[18]  Cheng Niu,et al.  Use of Maximum Entropy in Back-off Modeling for a Named Entity Tagger , 1999 .

[19]  W. Bruce Croft,et al.  A general language model for information retrieval (poster abstract) , 1999, SIGIR '99.

[20]  Tomek Strzalkowski,et al.  Natural Language Information Retrieval: TREC-8 Report , 1994, TREC.

[21]  尚弘 島影 National Institute of Standards and Technologyにおける超伝導研究及び生活 , 2001 .

[22]  Jun Wu,et al.  Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling , 2000, Comput. Speech Lang..

[23]  Alan F. Smeaton,et al.  Experiments on incorporating syntactic processing of user queries into a document retrieval strategy , 1988, SIGIR '88.

[24]  Rohini K. Srihari,et al.  Incorporating query term dependencies in language models for document retrieval , 2003, SIGIR '03.

[25]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[26]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.