Proximity-based document representation for named entity retrieval

One aspect in which retrieving named entities is different from retrieving documents is that the items to be retrieved - persons, locations, organizations - are only indirectly described by documents throughout the collection. Much work has been dedicated to finding references to named entities, in particular to the problems of named entity extraction and disambiguation. However, just as important for retrieval performance is how these snippets of text are combined to build named entity representations. We focus on the TREC expert search task where the goal is to identify people who are knowledgeable on a specific topic. Existing language modeling techniques for expert finding assume that terms and person entities are conditionally independent given a document. We present theoretical and experimental evidence that this simplifying assumption ignores information on how named entities relate to document content. To address this issue, we propose a new document representation which emphasizes text in proximity to entities and thus incorporates sequential information implicit in text. Our experiments demonstrate that the proposed model significantly improves retrieval performance. The main contribution of this work is an effective formal method for explicitly modeling the dependency between the named entities and terms which appear in a document.

[1]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[2]  Thomas Mandl,et al.  The effect of named entities on effectiveness in cross-language information retrieval evaluation , 2005, SAC '05.

[3]  James Allan,et al.  An Exploration of Entity Models, Collective Classification and Relation Description , 2004 .

[4]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[5]  W. Bruce Croft,et al.  Hierarchical Language Models for Expert Finding in Enterprise Corpora , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[6]  ChengXiang Zhai,et al.  Probabilistic Models for Expert Finding , 2007, ECIR.

[7]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[8]  Amanda Spink,et al.  Using the web to look for work: Implications for online job seeking and recruiting , 2005, Internet Res..

[9]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[10]  Nick Craswell,et al.  Overview of the TREC 2005 Enterprise Track , 2005, TREC.

[11]  Alistair Moffat,et al.  Effective document presentation with a locality-based similarity heuristic , 1999, SIGIR '99.

[12]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[13]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[14]  Jakob Nielsen,et al.  Automating the assignment of submitted manuscripts to reviewers , 1992, SIGIR '92.

[15]  Alfred Kobsa,et al.  Expert-Finding Systems for Organizations: Problem and Domain Analysis and the DEMOIR Approach , 2003, J. Organ. Comput. Electron. Commer..

[16]  Jack G. Conrad,et al.  A system for discovering relationships by feature extraction from text databases , 1994, SIGIR '94.

[17]  Shenghua Bao,et al.  Research on Expert Search at Enterprise Track of TREC 2006 , 2005, TREC.

[18]  Enrico Motta,et al.  The Open University at TREC 2006 Enterprise Track Expert Search Task , 2006, TREC.