Temporal contexts: Effective text classification in evolving document collections

The management of a huge and growing amount of information available nowadays makes Automatic Document Classification (ADC), besides crucial, a very challenging task. Furthermore, the dynamics inherent to classification problems, mainly on the Web, make this task even more challenging. Despite this fact, the actual impact of such temporal evolution on ADC is still poorly understood in the literature. In this context, this work concerns to evaluate, characterize and exploit the temporal evolution to improve ADC techniques. As first contribution we highlight the proposal of a pragmatical methodology for evaluating the temporal evolution in ADC domains. Through this methodology, we can identify measurable factors associated to ADC models degradation over time. Going a step further, based on such analyzes, we propose effective and efficient strategies to make current techniques more robust to natural shifts over time. We present a strategy, named temporal context selection, for selecting portions of the training set that minimize those factors. Our second contribution consists of proposing a general algorithm, called Chronos, for determining such contexts. By instantiating Chronos, we are able to reduce uncertainty and improve the overall classification accuracy. Empirical evaluations of heuristic instantiations of the algorithm, named WindowsChronos and FilterChronos, on two real document collections demonstrate the usefulness of our proposal. Comparing them against state-of-the-art ADC algorithms shows that selecting temporal contexts allows improvements on the classification accuracy up to 10%. Finally, we highlight the applicability and the generality of our proposal in practice, pointing out this study as a promising research direction.

[1]  Gisele L. Pappa,et al.  Temporally-aware algorithms for document classification , 2010, SIGIR '10.

[2]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[3]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[4]  Michael Gertz,et al.  On the value of temporal information in information retrieval , 2007, SIGF.

[5]  George Forman,et al.  Tackling concept drift by temporal inductive transfer , 2006, SIGIR.

[6]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[7]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[8]  Yi Liu,et al.  One-against-all multi-class SVM classification using reliability measures , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[9]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[10]  Bhavani M. Thuraisingham,et al.  Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams , 2009, ECML/PKDD.

[11]  Harold Borko,et al.  Automatic Document Classification , 1963, JACM.

[12]  Daniel P. Faith,et al.  Asymmetric binary similarity measures , 1983, Oecologia.

[13]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[14]  Ralf Klinkenberg,et al.  Boosting classifiers for drifting concepts , 2007, Intell. Data Anal..

[15]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[16]  Adriano M. Pereira,et al.  Exploiting temporal contexts in text classification , 2008, CIKM '08.

[17]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Peter D. Turney The Identification of Context-Sensitive Features: A Formal Definition of Context for Concept Learning , 2002, ArXiv.

[20]  Philip S. Yu,et al.  Mining Concept-Drifting Data Streams , 2010, Data Mining and Knowledge Discovery Handbook.

[21]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[22]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[23]  R. A. Groeneveld An Influence Function Approach to Describing the Skewness of a Distribution , 1991 .

[24]  Charu C. Aggarwal,et al.  Addressing Concept-Evolution in Concept-Drifting Data Streams , 2010, 2010 IEEE International Conference on Data Mining.

[25]  Byeong Ho Kang,et al.  Adaptive Web document classification with MCRDR , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[26]  David M. Blei,et al.  Bayesian Checking for Topic Models , 2011, EMNLP.

[27]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[28]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[29]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[30]  Wagner Meira,et al.  Understanding temporal aspects in document classification , 2008, WSDM '08.

[31]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[32]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[33]  Charu C. Aggarwal,et al.  Detecting Recurring and Novel Classes in Concept-Drifting Data Streams , 2011, 2011 IEEE 11th International Conference on Data Mining.

[34]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[35]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[36]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[37]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[38]  Indre Zliobaite,et al.  Combining Time and Space Similarity for Small Size Learning under Concept Drift , 2009, ISMIS.

[39]  Steven C. H. Hoi,et al.  OTL: A Framework of Online Transfer Learning , 2010, ICML.

[40]  P. John Clarkson,et al.  Web-Based Knowledge Management for Distributed Design , 2000, IEEE Intell. Syst..

[41]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[42]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[43]  Tie-Yan Liu,et al.  Adapting ranking SVM to document retrieval , 2006, SIGIR.

[44]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[45]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[46]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[47]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..