Temporally-aware algorithms for document classification

Automatic Document Classification (ADC) is still one of the major information retrieval problems. It usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use this model to classify unseen documents. The majority of supervised algorithms consider that all documents provide equally important information. However, in practice, a document may be considered more or less important to build the classification model according to several factors, such as its timeliness, the venue where it was published in, its authors, among others. In this paper, we are particularly concerned with the impact that temporal effects may have on ADC and how to minimize such impact. In order to deal with these effects, we introduce a temporal weighting function (TWF) and propose a methodology to determine it for document collections. We applied the proposed methodology to ACM-DL and Medline and found that the TWF of both follows a lognormal. We then extend three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF. Experiments showed that the temporally-aware classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art algorithms.

[1]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[2]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[3]  Rey-Long Liu,et al.  Incremental context mining for adaptive document classification , 2002, KDD.

[4]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[5]  Adriano M. Pereira,et al.  Exploiting temporal contexts in text classification , 2008, CIKM '08.

[6]  P. Royston Tests for Departure from Normality , 1992 .

[7]  Ralph B. D'Agostino,et al.  Tests for Departure from Normality , 1973 .

[8]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[9]  P. John Clarkson,et al.  Web-Based Knowledge Management for Distributed Design , 2000, IEEE Intell. Syst..

[10]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[11]  B ClarksonDouglas,et al.  A remark on algorithm 643: FEXACT , 1993 .

[12]  Harry Joe,et al.  A remark on algorithm 643: FEXACT: an algorithm for performing Fisher's exact test in r x c contingency tables , 1993, TOMS.

[13]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[14]  Byeong Ho Kang,et al.  Adaptive Web document classification with MCRDR , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[15]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[16]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[17]  Ralf Klinkenberg,et al.  Boosting classifiers for drifting concepts , 2007, Intell. Data Anal..

[18]  Wagner Meira,et al.  Understanding temporal aspects in document classification , 2008, WSDM '08.

[19]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[20]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[21]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[22]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[23]  Svetha Venkatesh,et al.  Using multiple windows to track concept drift , 2004, Intell. Data Anal..

[24]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .