论文信息 - Linear time series models for term weighting in information retrieval - 字舞流文

Linear time series models for term weighting in information retrieval

Common measures of term importance in information retrieval (IR) rely on counts of term frequency; rare terms receive higher weight in document ranking than common terms receive. However, realistic scenarios yield additional information about terms in a collection. Of interest in this article is the temporal behavior of terms as a collection changes over time. We propose capturing each term's collection frequency at discrete time intervals over the lifespan of a corpus and analyzing the resulting time series. We hypothesize the collection frequency of a weakly discriminative term x at time t is predictable by a linear model of the term's prior observations. On the other hand, a linear time series model for a strong discriminators' collection frequency will yield a poor fit to the data. Operationalizing this hypothesis, we induce three time-based measures of term importance and test these against state-of-the-art term weighting models. © 2010 Wiley Periodicals, Inc.

[1] Richard Sproat,et al. Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[2] Pak Chung Wong,et al. TOPIC ISLANDS/sup TM/-a wavelet-based text visualization system , 1998 .

[3] Akiko Aizawa,et al. An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[4] Jun Wang,et al. Portfolio theory of information retrieval , 2009, SIGIR.

[5] Murat Kulahci,et al. Introduction to Time Series Analysis and Forecasting , 2008 .

[6] W. Bruce Croft,et al. Time-based language models , 2003, CIKM '03.

[7] G. Box,et al. Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models , 1970 .

[8] James Allan,et al. Automatic generation of overview timelines , 2000, SIGIR '00.

[9] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[10] Thomas Roelleke. A frequency-based and a poisson-based definition of the probability of being informative , 2003, SIGIR '03.

[11] Gwilym M. Jenkins,et al. Time series analysis, forecasting and control , 1972 .

[12] Ravi Kumar,et al. Structure and evolution of blogspace , 2004, CACM.

[13] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[14] Ophir Frieder,et al. Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[15] John D. Lafferty,et al. A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[16] Fuchun Peng,et al. Improving search relevance for implicitly temporal queries , 2009, SIGIR.

[17] Ramanathan V. Guha,et al. Information diffusion through blogspace , 2004, SKDD.

[18] Nish Parikh,et al. A software system for buzz-based recommendations , 2008, KDD.

[19] Jun Wang,et al. Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval , 2009, ECIR.

[20] Andrew McCallum,et al. Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[21] Kam-Fai Wong,et al. Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[22] James Allan,et al. A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[23] Djoerd Hiemstra,et al. A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[24] Richard K. Belew,et al. Lexical dynamics and conceptual change: Analyses and implications for information retrieval , 2003 .

[25] Michael H. Kutner. Applied Linear Statistical Models , 1974 .

[26] David D. Jensen,et al. Mining of Concurrent Text and Time Series , 2008 .

[27] Raul Rodriguez-Esteban,et al. Visualizing evolution and impact of biomedical fields , 2008, J. Biomed. Informatics.

[28] Gwilym M. Jenkins,et al. Time series analysis, forecasting and control , 1971 .

[29] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[30] Ruey S. Tsay,et al. Analysis of Financial Time Series , 2005 .

[31] Jaideep Srivastava,et al. Event detection from time series data , 1999, KDD '99.

[32] David R. Anderson,et al. Model Selection and Multimodel Inference , 2003 .

[33] CHENGXIANG ZHAI,et al. A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[34] Karen Spärck Jones. Search Term Relevance Weighting given Little Relevance Information , 1997, J. Documentation.

[35] James Allan,et al. Introduction to topic detection and tracking , 2002 .

[36] David S. Stoffer,et al. Time series analysis and its applications , 2000 .

[37] T. W. Anderson,et al. Statistical analysis of time series , 1972 .

[38] T. W. Anderson. The Statistical Analysis of Time Series: Anderson/The Statistical , 1994 .

[39] Fernando Diaz,et al. Temporal profiles of queries , 2007, TOIS.

[40] Ruey S. Tsay,et al. Analysis of Financial Time Series: Tsay/Analysis of Financial Time Series , 2005 .

[41] James Allan,et al. Extracting significant time varying features from text , 1999, CIKM '99.

[42] Thomas Roelleke,et al. TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[43] Fernando Diaz,et al. Using temporal profiles of queries for precision prediction , 2004, SIGIR '04.

[44] Jure Leskovec,et al. Meme-tracking and the dynamics of the news cycle , 2009, KDD.