Temporal Feature Modification for Retrospective Categorization

We show that the intelligent use of one small piece of contextual information--a document's publication date--can improve the performance of classifiers trained on a text categorization task. We focus on academic research documents, where the date of publication undoubtedly has an effect on an author's choice of words. To exploit this contextual feature, we propose the technique of temporal feature modification, which takes various sources of lexical change into account, including changes in term frequency, associative strength between terms and categories, and dynamic categorization systems. We present results of classification experiments using both full text papers and abstracts of conference proceedings, showing improved classification accuracy across the whole collection, with performance increases of greater than 40% when temporal features are exploited. The technique is fast, classifier-independent, and works well even when making only a few modifications.

[1]  Hsin-Hsi Chen,et al.  An NLP & IR approach to topic detection , 2002 .

[2]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[3]  James Allan,et al.  Automatic generation of overview timelines , 2000, SIGIR '00.

[4]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[5]  Ingrid Renz,et al.  Adaptive Information Filtering: Learning in the Presence of Concept Drifts , 1998 .

[6]  David Jensen,et al.  TimeMines: Constructing Timelines with Statistical Models of Word Usage , 2000, KDD 2000.

[7]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[8]  Kenneth O. Stanley Learning Concept Drift with a Committee of Decision Trees , 2003 .

[9]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[10]  Geoffrey Sampson,et al.  The Oxford Handbook of Computational Linguistics , 2003, Lit. Linguistic Comput..

[11]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  James Allan,et al.  Explorations within topic tracking and detection , 2002 .

[14]  Dunja Mladenic,et al.  Machine Learning on non-homogeneous, distributed text data , 1998 .

[15]  Javed Mostafa,et al.  Detection of shifts in user interests for personalized information filtering , 1996, SIGIR '96.

[16]  Ben Taskar,et al.  Probabilistic Models of Text and Link Structure for Hypertext Classification , 2001 .