Linear time series models for term weighting in information retrieval

Common measures of term importance in information retrieval (IR) rely on counts of term frequency; rare terms receive higher weight in document ranking than common terms receive. However, realistic scenarios yield additional information about terms in a collection. Of interest in this article is the temporal behavior of terms as a collection changes over time. We propose capturing each term's collection frequency at discrete time intervals over the lifespan of a corpus and analyzing the resulting time series. We hypothesize the collection frequency of a weakly discriminative term x at time t is predictable by a linear model of the term's prior observations. On the other hand, a linear time series model for a strong discriminators' collection frequency will yield a poor fit to the data. Operationalizing this hypothesis, we induce three time-based measures of term importance and test these against state-of-the-art term weighting models. © 2010 Wiley Periodicals, Inc.

[1]  Richard Sproat,et al.  Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[2]  Pak Chung Wong,et al.  TOPIC ISLANDS/sup TM/-a wavelet-based text visualization system , 1998 .

[3]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[4]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[5]  Murat Kulahci,et al.  Introduction to Time Series Analysis and Forecasting , 2008 .

[6]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[7]  G. Box,et al.  Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models , 1970 .

[8]  James Allan,et al.  Automatic generation of overview timelines , 2000, SIGIR '00.

[9]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[10]  Thomas Roelleke A frequency-based and a poisson-based definition of the probability of being informative , 2003, SIGIR '03.

[11]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1972 .

[12]  Ravi Kumar,et al.  Structure and evolution of blogspace , 2004, CACM.

[13]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[14]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[15]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[16]  Fuchun Peng,et al.  Improving search relevance for implicitly temporal queries , 2009, SIGIR.

[17]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, SKDD.

[18]  Nish Parikh,et al.  A software system for buzz-based recommendations , 2008, KDD.

[19]  Jun Wang,et al.  Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval , 2009, ECIR.

[20]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[21]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[22]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[23]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[24]  Richard K. Belew,et al.  Lexical dynamics and conceptual change: Analyses and implications for information retrieval , 2003 .

[25]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[26]  David D. Jensen,et al.  Mining of Concurrent Text and Time Series , 2008 .

[27]  Raul Rodriguez-Esteban,et al.  Visualizing evolution and impact of biomedical fields , 2008, J. Biomed. Informatics.

[28]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[29]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[30]  Ruey S. Tsay,et al.  Analysis of Financial Time Series , 2005 .

[31]  Jaideep Srivastava,et al.  Event detection from time series data , 1999, KDD '99.

[32]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[33]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[34]  Karen Spärck Jones Search Term Relevance Weighting given Little Relevance Information , 1997, J. Documentation.

[35]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[36]  David S. Stoffer,et al.  Time series analysis and its applications , 2000 .

[37]  T. W. Anderson,et al.  Statistical analysis of time series , 1972 .

[38]  T. W. Anderson The Statistical Analysis of Time Series: Anderson/The Statistical , 1994 .

[39]  Fernando Diaz,et al.  Temporal profiles of queries , 2007, TOIS.

[40]  Ruey S. Tsay,et al.  Analysis of Financial Time Series: Tsay/Analysis of Financial Time Series , 2005 .

[41]  James Allan,et al.  Extracting significant time varying features from text , 1999, CIKM '99.

[42]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[43]  Fernando Diaz,et al.  Using temporal profiles of queries for precision prediction , 2004, SIGIR '04.

[44]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.