Linking Archives Using Document Enrichment and Term Selection

News, multimedia and cultural heritage archives are increasingly offering opportunities to create connections between their collections. We consider the task of linking archives: connecting an item in one archive to one or more items in other, often complementary archives. We focus on a specific instance of the task: linking items with a rich textual representation in a news archive to items with sparse annotations in a multimedia archive, where items should be linked if they describe the same or a related event. We find that the difference in textual richness of annotations presents a challenge and investigate two approaches: (i) to enrich sparsely annotated items with textually rich content; and (ii) to reduce rich news archive items using term selection. We demonstrate the positive impact of both approaches on linking to same events and linking to related events.

[1]  Carolyn Watters,et al.  Automatic association of news items , 1997, Inf. Process. Manag..

[2]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[3]  Katsumi Tanaka,et al.  Complementary information retrieval for cross-media news content , 2006, Inf. Syst..

[4]  Laura Hollink,et al.  Search behavior of media professionals at an audiovisual archive: A transaction log analysis , 2010 .

[5]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[6]  Stuart Macdonald,et al.  User Engagement in Research Data Curation , 2009, ECDL.

[7]  M. de Rijke,et al.  Linking online news and social media , 2011, WSDM '11.

[8]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[9]  Maarten de Rijke,et al.  Exploratory Search in an Audio-Visual Archive: Evaluating a Professional Search Tool for Non-Professional Users , 2011, EuroHCIR.

[10]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[11]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[12]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[13]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[14]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[15]  Bin Wang,et al.  A probabilistic model for retrospective news event detection , 2005, SIGIR '05.

[16]  Monika Henzinger,et al.  Query-Free News Search , 2003, WWW '03.

[17]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[18]  Dragomir R. Radev,et al.  NewsInEssence: summarizing online news topics , 2005, Commun. ACM.

[19]  Abraham Bernstein,et al.  The Semantic Web - ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009. Proceedings , 2009, SEMWEB.

[20]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[21]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[22]  Donna K. Harman,et al.  The TREC Test Collections , 2005 .

[23]  Roman Kern,et al.  German Encyclopedia Alignment Based on Information Retrieval Techniques , 2010, ECDL.

[24]  M. de Rijke,et al.  Learning Semantic Query Suggestions , 2009, SEMWEB.