Query representation for cross-temporal information retrieval

This paper addresses the problem of long-term language change in information retrieval (IR) systems. IR research has often ignored lexical drift. But in the emerging domain of massive digitized book collections, the risk of vocabulary mismatch due to language change is high. Collections such as Google Books and the Hathi Trust contain text written in the vernaculars of many centuries. With respect to IR, changes in vocabulary and orthography make 14th-Century English qualitatively different from 21st-Century English. This challenges retrieval models that rely on keyword matching. With this challenge in mind, we ask: given a query written in contemporary English, how can we retrieve relevant documents that were written in early English? We argue that search in historically diverse corpora is similar to cross-language retrieval (CLIR). By considering "modern" English and "archaic" English as distinct languages, CLIR techniques can improve what we call cross-temporal IR (CTIR). We focus on ways to combine evidence to improve CTIR effectiveness, proposing and testing several ways to handle language change during book search. We find that a principled combination of three sources of evidence during relevance feedback yields strong CTIR performance.

[1]  Gabriella Kazai,et al.  Overview of the INEX 2014 Social Book Search Track , 2014, CLEF.

[2]  Gabriella Kazai,et al.  Social book search: comparing topical relevance judgements and book suggestions for evaluation , 2012, CIKM.

[3]  Matjaz Perc,et al.  Evolution of the most common English words and phrases over the centuries , 2012, Journal of The Royal Society Interface.

[4]  Carsten Eickhoff,et al.  Report on BooksOnline'11: 4th workshop on online books, complementary social media, and crowdsourcing , 2012, SIGF.

[5]  Matthew Lease,et al.  Supervised language modeling for temporal resolution of texts , 2011, CIKM '11.

[6]  Miles Efron,et al.  Information search and retrieval in microblogs , 2011, J. Assoc. Inf. Sci. Technol..

[7]  Daqing He,et al.  Enhancing query translation with relevance feedback in translingual information retrieval , 2011, Inf. Process. Manag..

[8]  ChengXiang Zhai,et al.  Estimation of statistical translation models based on mutual information for ad hoc information retrieval , 2010, SIGIR.

[9]  Dan Cohen Is Google Good for History , 2010 .

[10]  Kjetil Nørvåg,et al.  Improving Temporal Language Models for Determining Time of Non-timestamped Documents , 2008, ECDL.

[11]  Gabriella Kazai,et al.  Overview of the INEX 2007 Book Search track: BookSearch '07 , 2008, SIGF.

[12]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[13]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[14]  貝塚 泰幸 死の寓意--再考 Sir Gawain and the Green Knight , 2007 .

[15]  Marc Sebban,et al.  Learning stochastic edit distance: Application in handwritten character recognition , 2006, Pattern Recognit..

[16]  Gareth J. F. Jones,et al.  Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents , 2006, Inf. Process. Manag..

[17]  Djoerd Hiemstra,et al.  Temporal Language Models for the Disclosure of Historical Text , 2005 .

[18]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[19]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[20]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[21]  W. Bruce Croft,et al.  Formal multiple-bernoulli models for language modeling , 2004, SIGIR '04.

[22]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[23]  Douglas W. Oard,et al.  Probabilistic structured query methods , 2003, SIGIR.

[24]  Michael Droettboom Correcting broken characters in the recognition of historical printed documents , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[25]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[26]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[27]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[29]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[30]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[31]  Edward Sapir,et al.  Language: An Introduction to the Study of Speech , 1955 .