History by Diversity: Helping Historians search News Archives

Longitudinal corpora like newspaper archives are of immense value to historical research, and time as an important factor for historians strongly influences their search behaviour in these archives. While searching for articles published over time, a key preference is to retrieve documents which cover the important aspects from important points in time which is different from standard search behavior. To support this search strategy, we introduce the notion of a Historical Query Intent to explicitly model a historian's search task and define an aspect-time diversification problem over news archives. We present a novel algorithm, HistDiv, that explicitly models the aspects and important time windows based on a historian's information seeking behavior. By incorporating temporal priors based on publication times and temporal expressions, we diversify both on the aspect and temporal dimensions. We test our methods by constructing a test collection based on The New York Times Collection with a workload of 30 queries of historical intent assessed manually. We find that HistDiv outperforms all competitors in subtopic recall with a slight loss in precision. We also present results of a qualitative user study to determine wether this drop in precision is detrimental to user experience. Our results show that users still preferred HistDiv's ranking.

[1]  Xueqi Cheng,et al.  Learning for search result diversification , 2014, SIGIR.

[2]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[3]  Milad Shokouhi,et al.  Behavioral dynamics on the web: Learning, modeling, and prediction , 2013, TOIS.

[4]  W. Bruce Croft,et al.  Term level search result diversification , 2013, SIGIR.

[5]  Miguel Costa,et al.  Evaluating Web Archive Search Systems , 2012, WISE.

[6]  H. Tibbo Primarily History in America: How U.S. Historians Search for Primary Materials at the Dawn of the Digital Age , 2007 .

[7]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[8]  Catherine A. Johnson,et al.  Accidentally Found on Purpose: Information-Seeking Behavior of Historians in Archives , 2002, The Library Quarterly.

[9]  Fuchun Peng,et al.  Improving search relevance for implicitly temporal queries , 2009, SIGIR.

[10]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[11]  Fernando Diaz,et al.  Temporal profiles of queries , 2007, TOIS.

[12]  Joemon M. Jose,et al.  The Impact of Temporal Intent Variability on Diversity Evaluation , 2013, ECIR.

[13]  Srikanta J. Bedathur,et al.  Index maintenance for time-travel text search , 2012, SIGIR '12.

[14]  Ricardo Campos,et al.  Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[15]  Wendy Duff,et al.  Chatting Up the Archivist: Social Capital and the Archival Researcher , 2005 .

[16]  Gerhard Weikum,et al.  InZeit: Efficiently Identifying Insightful Time Points , 2010, Proc. VLDB Endow..

[17]  Michael Gertz,et al.  On the value of temporal information in information retrieval , 2007, SIGF.

[18]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track | NIST , 2011 .

[19]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[20]  M. de Rijke,et al.  Using temporal bursts for query modeling , 2014, Information Retrieval.

[21]  M. de Rijke,et al.  Fusion helps diversification , 2014, SIGIR.

[22]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[23]  Fernando Diaz,et al.  Time is of the essence: improving recency ranking using Twitter data , 2010, WWW '10.

[24]  Roi Blanco,et al.  NTCIR temporalia: a test collection for temporal information access research , 2014, WWW.

[25]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[26]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[27]  W. Bruce Croft,et al.  Diversity by proportionality: an election-based approach to search result diversification , 2012, SIGIR '12.

[28]  Adrian Bingham,et al.  ‘The Digitization of Newspaper Archives: Opportunities and Challenges for Historians’ , 2010 .

[29]  Ricardo Campos,et al.  Survey of Temporal Information Retrieval and Related Applications , 2014, ACM Comput. Surv..

[30]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[31]  D. Case,et al.  The Collection and Use of Information by Some American Historians: A Study of Motives and Methods , 1991, The Library Quarterly.

[32]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[33]  Klaus Berberich,et al.  Temporal Diversification of Search Results , 2013, SIGIR 2013.

[34]  M. de Rijke,et al.  Adaptive Temporal Query Modeling , 2012, ECIR.

[35]  W. Bruce Croft,et al.  Temporal models for microblogs , 2012, CIKM.

[36]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[37]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[38]  Gerhard Weikum,et al.  A Language Modeling Approach for Temporal Information Needs , 2010, ECIR.

[39]  Ji-Rong Wen,et al.  Multi-dimensional search result diversification , 2011, WSDM '11.

[40]  Ricardo Baeza-Yates,et al.  Clustering and exploring search results using timeline constructions , 2009, CIKM.

[41]  Nattiya Kanhabua,et al.  Leveraging Dynamic Query Subtopics for Time-Aware Search Result Diversification , 2014, ECIR.

[42]  Ben Carterette,et al.  Probabilistic models of ranking novel documents for faceted topic retrieval , 2009, CIKM.

[43]  Klaus Berberich,et al.  Identifying Time Intervals of Interest to Queries , 2014, CIKM.

[44]  Danilo Montesi,et al.  Metric Spaces for Temporal Information Retrieval , 2014, ECIR.