A Two-Stage Machine learning approach for temporally-robust text classification

Abstract One of the most relevant research topics in Information Retrieval is Automatic Document Classification (ADC). Several ADC algorithms have been proposed in the literature. However, the majority of these algorithms assume that the underlying data distribution does not change over time. Previous work has demonstrated evidence of the negative impact of three main temporal effects in representative datasets textual datasets, reflected by variations observed over time in the class distribution, in the pairwise class similarities and in the relationships between terms and classes [1]. In order to minimize the impact of temporal effects in ADC algorithms, we have previously introduced the notion of a temporal weighting function (TWF), which reflects the varying nature of textual datasets. We have also proposed a procedure to derive the TWF’s expression and parameters. However, the derivation of the TWF requires the running of explicit and complex statistical tests, which are very cumbersome or can not even be run in several cases. In this article, we propose a machine learning methodology to automatically learn the TWF without the need to perform any statistical tests. We also propose new strategies to incorporate the TWF into ADC algorithms, which we call temporally-aware classifiers . Experiments showed that the fully-automated temporally-aware classifiers achieved significant gains (up to 17%) when compared to their non-temporal counterparts, even outperforming some state-of-the-art algorithms (e.g., SVM) in most cases, with large reductions in execution time.

[1]  Gisele L. Pappa,et al.  Temporally-aware algorithms for document classification , 2010, SIGIR '10.

[2]  Ivan Koychev,et al.  Gradual Forgetting for Adaptation to Concept Drift , 2000 .

[3]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[4]  P. John Clarkson,et al.  Web-Based Knowledge Management for Distributed Design , 2000, IEEE Intell. Syst..

[5]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[6]  Gisele L. Pappa,et al.  Automatic Document Classification Temporally Robust , 2010, J. Inf. Data Manag..

[7]  Wagner Meira,et al.  Understanding temporal aspects in document classification , 2008, WSDM '08.

[8]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[9]  Stefan Rüping,et al.  Concept Drift and the Importance of Example , 2003, Text Mining.

[10]  Ricard Gavaldà,et al.  Kalman Filters and Adaptive Windows for Learning in Data Streams , 2006, Discovery Science.

[11]  P. Royston Tests for Departure from Normality , 1992 .

[12]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[13]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[14]  Ludmila I. Kuncheva,et al.  On the window size for classification in changing environments , 2009, Intell. Data Anal..

[15]  Ralph B. D'Agostino,et al.  Tests for Departure from Normality , 1973 .

[16]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[17]  Mohamed S. Kamel,et al.  Pairwise optimized Rocchio algorithm for text categorization , 2011, Pattern Recognit. Lett..

[18]  E. S. Pearson,et al.  Tests for departure from normality. Empirical results for the distributions of b2 and √b1 , 1973 .

[19]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[20]  George Forman,et al.  Tackling concept drift by temporal inductive transfer , 2006, SIGIR.

[21]  Philip S. Yu,et al.  An ensemble-based approach to fast classification of multi-label data streams , 2011, 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[22]  Ralf Klinkenberg,et al.  Boosting classifiers for drifting concepts , 2007, Intell. Data Anal..

[23]  Lior Wolf,et al.  In Defense of Word Embedding for Generic Text Representation , 2015, NLDB.

[24]  Wagner Meira,et al.  Temporal contexts: Effective text classification in evolving document collections , 2013, Inf. Syst..

[25]  Xiaowei Yang,et al.  Several SVM Ensemble Methods Integrated with Under-Sampling for Imbalanced Data Learning , 2009, ADMA.

[26]  Harry Joe,et al.  A remark on algorithm 643: FEXACT: an algorithm for performing Fisher's exact test in r x c contingency tables , 1993, TOMS.

[27]  Marcos André Gonçalves,et al.  Tackling Temporal Effects in Automatic Document Classification , 2011, J. Inf. Data Manag..

[28]  Jussara M. Almeida,et al.  A quantitative analysis of the temporal effects on automatic text classification , 2016, J. Assoc. Inf. Sci. Technol..

[29]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[30]  Rey-Long Liu,et al.  Incremental context mining for adaptive document classification , 2002, KDD.

[31]  Enhong Chen,et al.  Exploiting probabilistic topic models to improve text categorization under class imbalance , 2011, Inf. Process. Manag..

[32]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[33]  Jie Zhou,et al.  Transfer estimation of evolving class priors in data stream classification , 2010, Pattern Recognit..

[34]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .

[35]  Koichiro Yamauchi,et al.  Learning, detecting, understanding, and predicting concept changes , 2009, 2009 International Joint Conference on Neural Networks.

[36]  Koichiro Yamauchi,et al.  Detecting Concept Drift Using Statistical Testing , 2007, Discovery Science.

[37]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[38]  Gisele L. Pappa,et al.  Estimating the Credibility of Examples in Automatic Document Classification , 2010, J. Inf. Data Manag..

[39]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[40]  Indre Zliobaite,et al.  Combining Time and Space Similarity for Small Size Learning under Concept Drift , 2009, ISMIS.

[41]  Byeong Ho Kang,et al.  Adaptive Web document classification with MCRDR , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[42]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[43]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[44]  Jie Zhou,et al.  Non-stationary data sequence classification using online class priors estimation , 2008, Pattern Recognit..

[45]  Marcos André Gonçalves,et al.  BROOF: Exploiting Out-of-Bag Errors, Boosting and Random Forests for Effective Automated Classification , 2015, SIGIR.

[46]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[47]  Anton Dries,et al.  Adaptive concept drift detection , 2009, SDM.

[48]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[49]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[50]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[51]  Svetha Venkatesh,et al.  Using multiple windows to track concept drift , 2004, Intell. Data Anal..