Fusion and diversification in information retrieval

Data fusion and search result diversification are two critical research topics in information retrieval. Data fusion approaches combine search result lists in order to produce a new and hopefully better ranking. We propose two data fusion models for microblog search that exploit temporal information and infer rank scores of missing documents in the lists to be fused. We also propose a fusion method based on manifolds. The method constructs manifolds, let low ranked documents be rewarded to be relevant by high ranked documents in the same manifolds, and utilize the top-k documents as anchors to enhance the efficiency of data fusion. Search result diversification is widely being studied as a way of tackling query ambiguity. Instead of trying to identify the "correct" interpretation behind a query, the idea is to make the search results diversified so that users with different backgrounds will find at least one of these results to be relevant. We examine the hypothesis that data fusion can improve performance in terms of diversity metrics, and proposes a new data fusion method, called diversified data fusion for search result diversification. We also study the problem of personalized diversification via supervised learning, with the goal of enhancing both diversification and personalization performance. The results in this thesis show how both our proposed data fusion and search result diversification methods improve retrieval performance and how they relate to each other. The insights in this thesis may be used to improve retrieval performance for a range of tasks in information retrieval.

[1]  M. de Rijke,et al.  Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts , 2011, ECIR.

[2]  Qi Tian,et al.  Weakly supervised codebook learning by iterative label propagation with graph quantization , 2013, Signal Process..

[3]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[4]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[5]  W. Bruce Croft,et al.  Term level search result diversification , 2013, SIGIR.

[6]  W. Bruce Croft,et al.  Evaluating Text Representations for Retrieval of the Best Group of Documents , 2008, ECIR.

[7]  M. de Rijke,et al.  The Impact of Semantic Document Expansion on Cluster-Based Fusion for Microblog Search , 2014, ECIR.

[8]  Kazuhiro Seki,et al.  Combining Recency and Topic-Dependent Temporal Variation for Microblog Search , 2013, ECIR.

[9]  Jeffrey Katzer,et al.  A study of the overlap among document representations , 1983, SIGIR '83.

[10]  Wei Liu,et al.  Robust and Scalable Graph-Based Semisupervised Learning , 2012, Proceedings of the IEEE.

[11]  Arjen P. de Vries,et al.  Combining implicit and explicit topic representations for result diversification , 2012, SIGIR '12.

[12]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[13]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[14]  Idan Szpektor,et al.  When relevance is not enough: promoting diversity and freshness in personalized question recommendation , 2013, WWW.

[15]  Hsin-Hsi Chen,et al.  A study of learning a merge model for multilingual information retrieval , 2008, SIGIR '08.

[16]  Wei Liu,et al.  Large Graph Construction for Scalable Semi-Supervised Learning , 2010, ICML.

[17]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[18]  Prasenjit Majumder,et al.  Query Expansion for Microblog Retrieval , 2011, TREC.

[19]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[20]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[21]  Harry Shum,et al.  An Empirical Study on Learning to Rank of Tweets , 2010, COLING.

[22]  Luo Si,et al.  A weighted curve fitting method for result merging in federated search , 2011, SIGIR '11.

[23]  Katja Hofmann,et al.  Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods , 2013, TOIS.

[24]  Sanda Harabagiu Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval edited by W. Bruce Croft , 2001, Computational Linguistics.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[27]  Yang Zhang,et al.  Modeling user posting behavior on social media , 2012, SIGIR '12.

[28]  Giorgio Gambosi,et al.  FUB, IASI-CNR, UNIVAQ at TREC 2011 Microblog Track , 2011, Text Retrieval Conference.

[29]  Pinar Donmez,et al.  On the local optimality of LambdaRank , 2009, SIGIR.

[30]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[31]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[32]  M. de Rijke,et al.  Personalized document re-ranking based on Bayesian probabilistic matrix factorization , 2014, SIGIR.

[33]  W. Bruce Croft,et al.  Diversity by proportionality: an election-based approach to search result diversification , 2012, SIGIR '12.

[34]  Jimmy J. Lin,et al.  Pseudo test collections for learning web search ranking functions , 2011, SIGIR.

[35]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[36]  Craig MacDonald,et al.  University of Glasgow at TREC 2012: Experiments with Terrier in Medical Records, Microblog, and Web Tracks , 2012, TREC.

[37]  J. Shane Culpepper,et al.  Efficient in-memory top-k document retrieval , 2012, SIGIR '12.

[38]  Joemon M. Jose,et al.  Personalizing Web Search with Folksonomy-Based User and Document Profiles , 2010, ECIR.

[39]  Stephen E. Robertson,et al.  The TREC-8 Filtering Track Final Report , 1999, TREC.

[40]  Divesh Srivastava,et al.  Compact explanation of data fusion decisions , 2013, WWW.

[41]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[42]  Oren Kurland,et al.  Predicting query performance for fusion-based retrieval , 2012, CIKM.

[43]  M. de Rijke,et al.  Credibility-inspired ranking for blog post retrieval , 2012, Information Retrieval.

[44]  Michael Granitzer,et al.  Realtime Ad Hoc Search in Twitter: Know-Center at TREC Microblog Track 2011 , 2011, TREC.

[45]  Mounia Lalmas,et al.  A survey on the use of relevance feedback for information access systems , 2003, The Knowledge Engineering Review.

[46]  Craig MacDonald,et al.  Explicit Search Result Diversification through Sub-queries , 2010, ECIR.

[47]  Eduard H. Hovy,et al.  Structured Event Retrieval over Microblog Archives , 2012, NAACL.

[48]  M. de Rijke,et al.  Burst-aware data fusion for microblog search , 2015, Inf. Process. Manag..

[49]  Mohand Boughanem,et al.  IRIT at TREC Microblog 2015 , 2015, TREC.

[50]  Brendan T. O'Connor,et al.  TweetMotif: Exploratory Search and Topic Summarization for Twitter , 2010, ICWSM.

[51]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[52]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[53]  Fabio Crestani,et al.  Reducing the Uncertainty in Resource Selection , 2013, ECIR.

[54]  Wei Gao,et al.  Exploring Tweets Normalization and Query Time Sensitivity for Twitter Search , 2011, TREC.

[55]  Padhraic Smyth,et al.  Text-based measures of document diversity , 2013, KDD.

[56]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[57]  Luo Si,et al.  An effective and efficient results merging strategy for multilingual information retrieval in federated search environments , 2007, Information Retrieval.

[58]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[59]  Mark J. F. Gales,et al.  Structured SVMs for Automatic Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[60]  Katja Hofmann,et al.  Effects of Position Bias on Click-Based Recommender Evaluation , 2014, ECIR.

[61]  ChengXiang Zhai,et al.  Implicit user modeling for personalized search , 2005, CIKM '05.

[62]  David M. Blei,et al.  Syntactic Topic Models , 2008, NIPS.

[63]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[64]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[65]  Thorsten Joachims,et al.  Predicting diverse subsets using structural SVMs , 2008, ICML '08.

[66]  Milad Shokouhi,et al.  Federated Search , 2011, Found. Trends Inf. Retr..

[67]  Tiejun Zhao,et al.  HIT at TREC 2012 Microblog Track , 2012, TREC.

[68]  James Allan,et al.  Sentiment diversification with different biases , 2013, SIGIR.

[69]  Charles L. A. Clarke,et al.  Overview of the TREC 2012 Web Track , 2012, TREC.

[70]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[71]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[72]  Wei Chu,et al.  Personalized ranking model adaptation for web search , 2013, SIGIR.

[73]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[74]  Dan Roth,et al.  Unsupervised rank aggregation with distance-based models , 2008, ICML '08.

[75]  Maarten de Rijke,et al.  Finding knowledgeable groups in enterprise corpora , 2013, SIGIR.

[76]  M. de Rijke,et al.  Linking online news and social media , 2011, WSDM '11.

[77]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[78]  Fabio Crestani,et al.  Distributed Information Retrieval and Applications , 2013, ECIR.

[79]  Ophir Frieder,et al.  Disproving the fusion hypothesis: an analysis of data fusion via effective information retrieval strategies , 2003, SAC '03.

[80]  Kevin Chen-Chuan Chang,et al.  Predicate rewriting for translating Boolean queries in a heterogeneous information system , 1999, TOIS.

[81]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[82]  S. Robertson The probability ranking principle in IR , 1997 .

[83]  A. R. BARTONt,et al.  The Next Twenty Years in Information Retrieval : Some Goals and Predictions , 2010 .

[84]  Tao Qin,et al.  Supervised rank aggregation , 2007, WWW '07.

[85]  M. de Rijke,et al.  Adaptive Temporal Query Modeling , 2012, ECIR.

[86]  W. Bruce Croft,et al.  Temporal models for microblogs , 2012, CIKM.

[87]  W. Bruce Croft,et al.  Geometric representations for multiple documents , 2010, SIGIR.

[88]  Tao Qin,et al.  A New Probabilistic Model for Rank Aggregation , 2010, NIPS.

[89]  Thomas Hofmann,et al.  Latent semantic models for collaborative filtering , 2004, TOIS.

[90]  Mohamed Farah,et al.  An outranking approach for rank aggregation in information retrieval , 2007, SIGIR.

[91]  Craig MacDonald,et al.  Intent-aware search result diversification , 2011, SIGIR.

[92]  Chun Chen,et al.  Online detection of bursty events and their evolution in news streams , 2010, Journal of Zhejiang University SCIENCE C.

[93]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[94]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[95]  Miles Efron,et al.  Information search and retrieval in microblogs , 2011, J. Assoc. Inf. Sci. Technol..

[96]  Fabio Crestani,et al.  Qualitative , and Quantitative Analyses of Small-Document Approaches to Resource Selection , 2014 .

[97]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[98]  Sihem Amer-Yahia,et al.  Real-time recommendation of diverse related articles , 2013, WWW.

[99]  Avi Arampatzis,et al.  Unsupervised linear score normalization revisited , 2012, SIGIR '12.

[100]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[101]  Wei Chu,et al.  Modeling the impact of short- and long-term behavior on search personalization , 2012, SIGIR '12.

[102]  Mohand Boughanem,et al.  IRIT at TREC Microblog 2012: adhoc Task , 2012, TREC.

[103]  W. Bruce Croft,et al.  Quality models for microblog retrieval , 2012, CIKM.

[104]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[105]  Filip Radlinski,et al.  Improving personalized web search using result diversification , 2006, SIGIR.

[106]  Milad Shokouhi,et al.  LambdaMerge: merging the results of query reformulations , 2011, WSDM '11.

[107]  W. Bruce Croft Advances in Informational Retrieval: Recent Research from the Center for Intelligent Information Retrieval , 2000 .

[108]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[109]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[110]  Pablo Castells,et al.  Personalized diversification of search results , 2012, SIGIR '12.

[111]  Stéphane Marchand-Maillet,et al.  Multiview clustering: a late fusion approach using latent models , 2009, SIGIR.

[112]  M. de Rijke,et al.  Time-sensitive Personalized Query Auto-Completion , 2014, CIKM.

[113]  Luo Si,et al.  Search result diversification in resource selection for federated search , 2013, SIGIR.

[114]  W. Thurston The geometry and topology of three-manifolds , 1979 .

[115]  Chao Liu,et al.  Recommender systems with social regularization , 2011, WSDM '11.

[116]  Lambert Schomaker,et al.  Variants of the Borda count method for combining ranked classifier hypotheses , 2000 .

[117]  Jungyun Seo,et al.  SiteQ: Engineering High Performance QA System Using Lexico-Semantic Pattern Matching and Shallow NLP , 2001, TREC.

[118]  Rui Li,et al.  A Time-Aware Language Model for Microblog Retrieval , 2012, TREC.

[119]  Dimitrios Gunopulos,et al.  On burstiness-aware search for document sequences , 2009, KDD.

[120]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[121]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[122]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[123]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[124]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[125]  J. L. De Jong,et al.  Heuristics in dynamic scheduling: a practical framework with a case study in elevator dispatching , 2012 .

[126]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[127]  András A. Benczúr,et al.  Methods for large scale SVD with missing values , 2007 .

[128]  M. de Rijke,et al.  Late Data Fusion for Microblog Search , 2013, ECIR.

[129]  Jimeng Sun,et al.  Dynamic Mixture Models for Multiple Time-Series , 2007, IJCAI.

[130]  M. de Rijke,et al.  Pseudo test collections for training and tuning microblog rankers , 2013, SIGIR.

[131]  Nicholas J. Belkin,et al.  Personalization of search results using interaction behaviors in search sessions , 2012, SIGIR '12.

[132]  Michael R. Lyu,et al.  Improving Recommender Systems by Incorporating Social Contextual Information , 2011, TOIS.

[133]  Luo Si,et al.  Mixture model with multiple centralized retrieval algorithms for result merging in federated search , 2012, SIGIR '12.

[134]  M. de Rijke,et al.  Personalized time-aware tweets summarization , 2013, SIGIR.

[135]  M. de Rijke,et al.  Fusion helps diversification , 2014, SIGIR.

[136]  Walter L. Ruzzo,et al.  A Linear Time Algorithm for Finding All Maximal Scoring Subsequences , 1999, ISMB.

[137]  Jun Wang,et al.  Adaptive diversification of recommendation results via latent factor portfolio , 2012, SIGIR '12.

[138]  Wen Gao,et al.  Manifold-Manifold Distance with application to face recognition based on image set , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[139]  Wai Lam,et al.  An unsupervised topic segmentation model incorporating word order , 2013, SIGIR.

[140]  Fernando Diaz,et al.  Improving recency ranking using twitter data , 2013, TIST.

[141]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[142]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[143]  Yubin Kim,et al.  Overcoming Vocabulary Limitations in Twitter Microblogs , 2012, TREC.

[144]  Roger M. Needham,et al.  The thesaurus approach to information retrieval , 1958 .

[145]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[146]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[147]  M. de Rijke,et al.  Hierarchical multi-label classification of social text streams , 2014, SIGIR.

[148]  Saul Vargas,et al.  Explicit relevance models in intent-oriented information retrieval diversification , 2012, SIGIR '12.

[149]  Fernando Diaz,et al.  Regularizing ad hoc retrieval scores , 2005, CIKM '05.

[150]  M. M. Sufyan Beg Parallel rank aggregation for theWorld Wide Web , 2004 .

[151]  Xiaoyan Zhu,et al.  Sentiment Analysis with Global Topics and Local Dependency , 2010, AAAI.

[152]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[153]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[154]  Iadh Ounis,et al.  Overview of the TREC 2011 Microblog Track , 2011, TREC.

[155]  Nick Koudas,et al.  Identifying, attributing and describing spatial bursts , 2010, Proc. VLDB Endow..

[156]  Yue Liu,et al.  ICTNET at Microblog Track TREC 2012 , 2012, TREC.

[157]  Oren Kurland,et al.  Cluster-based fusion of retrieved lists , 2011, SIGIR.

[158]  Thomas Gottron,et al.  Searching microblogs: coping with sparsity and document quality , 2011, CIKM '11.

[159]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[160]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[161]  Peter Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997 .

[162]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[163]  M. de Rijke,et al.  Time-Aware Rank Aggregation for Microblog Search , 2014, CIKM.

[164]  Dan Wu,et al.  Toward a Robust data fusion for document retrieval , 2008, 2008 International Conference on Natural Language Processing and Knowledge Engineering.

[165]  Oren Kurland,et al.  Utilizing inter-document similarities in federated search , 2012, SIGIR '12.

[166]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[167]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[168]  Dimitrios Gunopulos,et al.  Identifying similarities, periodicities and bursts for online search queries , 2004, SIGMOD '04.

[169]  Hongyuan Zha,et al.  Adaptive Manifold Learning , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[170]  Donna K. Harman,et al.  The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.

[171]  Craig MacDonald,et al.  Overview of the TREC-2012 Microblog Track , 2012, Text Retrieval Conference.

[172]  Donald Metzler,et al.  USC/ISI at TREC 2011: Microblog Track , 2011, TREC.

[173]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[174]  Ting Wang,et al.  Improving Twitter Retrieval by Exploiting Structural Information , 2012, AAAI.

[175]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[176]  Mary Beth Rosson,et al.  How and why people Twitter: the role that micro-blogging plays in informal communication at work , 2009, GROUP.

[177]  M. de Rijke,et al.  Personalized search result diversification via structured learning , 2014, KDD.

[178]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[179]  Luis Gravano,et al.  Answering General Time-Sensitive Queries , 2008, IEEE Transactions on Knowledge and Data Engineering.

[180]  Ben He,et al.  GUCAS at TREC 2011 Microblog Track , 2011, TREC.

[181]  Kazuhiro Seki,et al.  Improving pseudo-relevance feedback via tweet selection , 2013, CIKM.

[182]  Oren Kurland,et al.  Utilizing relevance feedback in fusion-based retrieval , 2014, SIGIR.

[183]  Tomoharu Iwata,et al.  Geo topic model: joint modeling of user's activity area and interests for location recommendation , 2013, WSDM.

[184]  Xueqi Cheng,et al.  Learning for search result diversification , 2014, SIGIR.

[185]  M.M. Sufyan Beg Parallel rank aggregation for theWorld Wide Web , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[186]  Miles Efron,et al.  Hashtag retrieval in a microblogging environment , 2010, SIGIR.

[187]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.