Searching Locally-Defined Entities

When consuming content, users typically encounter entities that they are not familiar with. A common scenario is when users want to find information about entities directly within the content they are consuming. For example, when reading the book "Adventures of Huckleberry Finn", a user may lose track of the character Mary Jane and want to find some paragraph in the book that gives relevant information about her. The way this is achieved today is by invoking the ubiquitous Find function ("Ctrl-F"). However, this only returns exact-matching results without any relevance ranking, leading to a suboptimal user experience. How can we go beyond the Ctrl-F function? To tackle this problem, we present algorithms for semantic matching and relevance ranking that enable users to effectively search and understand entities that have been defined in the content that they are consuming, which we call locally-defined entities. We first analyze the limitations of standard information retrieval models when applied to searching locally-defined entities, and then we propose a novel semantic entity retrieval model that addresses these limitations. We also present a ranking model that leverages multiple novel signals to model the relevance of a passage. A thorough experimental evaluation of the approach in the real-word application of searching characters within e-books shows that it outperforms the baselines by 60%+ in terms of NDCG.

[1]  Ivan Koychev,et al.  Within-Document Retrieval: A User-Centred Evaluation of Relevance Profiling , 2004, Information Retrieval.

[2]  ChengXiang Zhai,et al.  Positional language models for information retrieval , 2009, SIGIR.

[3]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[4]  Charles L. A. Clarke,et al.  Exploiting redundancy in question answering , 2001, SIGIR '01.

[5]  Elizabeth D. Liddy,et al.  The use of anaphoric resolution for document description in information retrieval , 1988, SIGIR '88.

[6]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[7]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[8]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[9]  Karen Spärck Jones Automatic summarising: The state of the art , 2007, Inf. Process. Manag..

[10]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[11]  George Buchanan,et al.  The myth of find: user behaviour and attitudes towards the basic search feature , 2008, JCDL '08.

[12]  Hwee Tou Ng,et al.  A 2-poisson model for probabilistic coreference of named entities for improved text retrieval , 2009, SIGIR.

[13]  Heeyoung Lee,et al.  Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules , 2013, CL.

[14]  Jimmy J. Lin,et al.  What Makes a Good Answer? The Role of Context in Question Answering , 2003, INTERACT.

[15]  Gareth J. F. Jones,et al.  An investigation of broad coverage automatic pronoun resolution for information retrieval , 2003, SIGIR '03.

[16]  Quoc V. Le,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, Neural Information Processing Systems.

[17]  David J. Harper,et al.  A language modelling approach to relevance profiling for document browsing , 2002, JCDL '02.

[18]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.

[19]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[20]  Edith Bolling Anaphora Resolution , 2006 .

[21]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[22]  W. Bruce Croft,et al.  Proximity-based document representation for named entity retrieval , 2007, CIKM '07.

[23]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[24]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[25]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[26]  Jimmy J. Lin,et al.  Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[27]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[28]  ChengXiang Zhai,et al.  Extraction of coherent relevant passages using hidden Markov models , 2006, TOIS.

[29]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[30]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[31]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[32]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[33]  Jimmy J. Lin,et al.  The role of context in question answering systems , 2003, CHI Extended Abstracts.

[34]  Justin Zobel,et al.  Effective ranking with arbitrary passages , 2001, J. Assoc. Inf. Sci. Technol..