Understanding temporal aspects in document classification

Due to the increasing amount of information present on the Web, Automatic Document Classification (ADC) has become an important research topic. ADC usually follows a standard supervised learning strategy, where we first build a model using preclassified documents and then use it to classify new unseen documents. One major challenge for ADC in many scenarios is that the characteristics of the documents and the classes to which they belong may change over time. However, most of the current techniques for ADC are applied without taking into account the temporal evolution of the collection of documents In this work, we perform a detailed study of the temporal evolution in the ADC, introducing an analysis methodology. We discuss that temporal evolution may be explained by three factors: 1) class distribution; 2) term distribution; and 3) class similarity. We employ metrics and experimental strategies capable of isolating each of these factors in order to analyze them separately, using two very different document collections: the ACM Digital Library and the Medline medical collections. Moreover, we present some preliminary results of potential gains that could be obtained by varying the training set to find the ideal size that minimizes the time effects. We show that by using just 69% of the ACM database, we are able to have an accuracy of 89.76%, and with only 25% of the Medline, an accuracy of 87.57%, which means gains of up to 20% in accuracy with much smaller training sets

[1]  Ingrid Renz,et al.  Adaptive information filtering: detecting changes in text streams , 1999, CIKM '99.

[2]  James Allan,et al.  Incremental relevance feedback for information filtering , 1996, SIGIR '96.

[3]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[4]  S. Haykin Adaptive Filters , 2007 .

[5]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[6]  Rey-Long Liu,et al.  Incremental context mining for adaptive document classification , 2002, KDD.

[7]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[8]  Harold Borko,et al.  Automatic Document Classification , 1963, JACM.

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  Yiming Yang,et al.  Margin-based local regression for adaptive filtering , 2003, CIKM '03.

[11]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[12]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[13]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[14]  Ingrid Renz,et al.  Adaptive Information Filtering: Learning in the Presence of Concept Drifts , 1998 .

[15]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[16]  George Forman,et al.  Tackling concept drift by temporal inductive transfer , 2006, SIGIR.

[17]  Fernando Diaz,et al.  Temporal profiles of queries , 2007, TOIS.

[18]  Ali H. Sayed,et al.  Adaptive Filters , 2008 .

[19]  S. Haykin,et al.  Lessons on adaptive systems for signal processing, communications, and control , 1999, IEEE Signal Processing Magazine.

[20]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[21]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[22]  P. John Clarkson,et al.  Web-Based Knowledge Management for Distributed Design , 2000, IEEE Intell. Syst..

[23]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[24]  Ralf Klinkenberg,et al.  Boosting classifiers for drifting concepts , 2007, Intell. Data Anal..

[25]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .