A quantitative analysis of the temporal effects on automatic text classification

Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well‐known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.

[1]  Ivan Koychev,et al.  Gradual Forgetting for Adaptation to Concept Drift , 2000 .

[2]  Srinivasan Parthasarathy,et al.  Distance-based outlier detection , 2010, Proc. VLDB Endow..

[3]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[4]  Anton Dries,et al.  Adaptive concept drift detection , 2009, SDM.

[5]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[7]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[8]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[9]  Svetha Venkatesh,et al.  Using multiple windows to track concept drift , 2004, Intell. Data Anal..

[10]  P. John Clarkson,et al.  Web-Based Knowledge Management for Distributed Design , 2000, IEEE Intell. Syst..

[11]  Gisele L. Pappa,et al.  Tuning Genetic Programming parameters with factorial designs , 2010, IEEE Congress on Evolutionary Computation.

[12]  KlinkenbergRalf Learning drifting concepts: Example selection vs. example weighting , 2004 .

[13]  Byeong Ho Kang,et al.  Adaptive Web document classification with MCRDR , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[14]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[15]  Xiaowei Yang,et al.  Several SVM Ensemble Methods Integrated with Under-Sampling for Imbalanced Data Learning , 2009, ADMA.

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[18]  Indre Zliobaite,et al.  Combining Time and Space Similarity for Small Size Learning under Concept Drift , 2009, ISMIS.

[19]  Jussara M. Almeida,et al.  The problem of cooperation among different wireless sensor networks , 2008, MSWiM '08.

[20]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[21]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[22]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[23]  George Forman,et al.  Tackling concept drift by temporal inductive transfer , 2006, SIGIR.

[24]  Jie Zhou,et al.  Transfer estimation of evolving class priors in data stream classification , 2010, Pattern Recognit..

[25]  Maria Virvou,et al.  An Intelligent TV-Shopping Application that Provides Recommendations , 2007 .

[26]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[27]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[28]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[29]  Koichiro Yamauchi,et al.  Learning, detecting, understanding, and predicting concept changes , 2009, 2009 International Joint Conference on Neural Networks.

[30]  J. S. Hunter,et al.  Statistics for experimenters : an introduction to design, data analysis, and model building , 1979 .

[31]  Adriano M. Pereira,et al.  Exploiting temporal contexts in text classification , 2008, CIKM '08.

[32]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[33]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[34]  Stefan Rüping,et al.  Concept Drift and the Importance of Example , 2003, Text Mining.

[35]  Ralf Klinkenberg,et al.  Boosting classifiers for drifting concepts , 2007, Intell. Data Anal..

[36]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[37]  Ricard Gavaldà,et al.  Kalman Filters and Adaptive Windows for Learning in Data Streams , 2006, Discovery Science.

[38]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[39]  Rey-Long Liu,et al.  Incremental context mining for adaptive document classification , 2002, KDD.

[40]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[41]  Koichiro Yamauchi,et al.  Detecting Concept Drift Using Statistical Testing , 2007, Discovery Science.

[42]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[43]  Ludmila I. Kuncheva,et al.  On the window size for classification in changing environments , 2009, Intell. Data Anal..

[44]  Gisele L. Pappa,et al.  Temporally-aware algorithms for document classification , 2010, SIGIR '10.

[45]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[46]  Mohamed S. Kamel,et al.  Pairwise optimized Rocchio algorithm for text categorization , 2011, Pattern Recognit. Lett..

[47]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[48]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[49]  Jie Zhou,et al.  Non-stationary data sequence classification using online class priors estimation , 2008, Pattern Recognit..

[50]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[51]  Ee-Peng Lim,et al.  On strategies for imbalanced text classification using SVM: A comparative study , 2009, Decis. Support Syst..

[52]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[53]  Philip S. Yu,et al.  An ensemble-based approach to fast classification of multi-label data streams , 2011, 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[54]  Wagner Meira,et al.  Understanding temporal aspects in document classification , 2008, WSDM '08.