Sequential patterns for text categorization

Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order. Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about their data. Such rules are, however, very important for users to describe trends in the data they have to analyze. In this framework, an association-rule based approach has been proposed by Bing Liu (CBA). We propose, in this paper, to extend this approach by using sequential patterns in the SPaC method (Sequential Patterns for Classification) for text categorization. Taking order into account allows us to represent the succession of words through a document without complex and time-consuming representations and treatments such as those performed in natural language and grammatical methods. The original method we propose here consists in mining sequential patterns in order to build a classifier. We experimentally show that our proposal is relevant, and that it is very interesting compared to other methods. In particular, our method outperforms CBA and provides better results than SVM on some corpus.

[1]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Maguelonne Teisseire,et al.  Pre-processing time constraints for efficiently mining generalized sequential patterns , 2004, Proceedings. 11th International Symposium on Temporal Representation and Reasoning, 2004. TIME 2004..

[4]  Elena Baralis,et al.  On support thresholds in associative classification , 2004, SAC '04.

[5]  Pak Chung Wong,et al.  Visualizing sequential patterns for text mining , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[6]  Jiawei Han,et al.  IncSpan: incremental mining of sequential patterns in large database , 2004, KDD.

[7]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[8]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[9]  Osmar R. Zaïane,et al.  Text document categorization by term association , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Bing Liu,et al.  Classification Using Association Rules: Weaknesses and Enhancements , 2001 .

[11]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[14]  Elena Baralis,et al.  Majority Classification by Means of Association Rules , 2003, PKDD.

[15]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[16]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[17]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[18]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[19]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[20]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorisation: a survey , 1999 .

[21]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[22]  Davy Janssens,et al.  Adapting the CBA algorithm by means of intensity of implication , 2005, Inf. Sci..

[23]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[24]  Susanne Hoche,et al.  Effective rule induction from labeled graphs , 2006, SAC.

[25]  Kamal Ali,et al.  Partial Classification Using Association Rules , 1997, KDD.

[26]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[27]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[28]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[29]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[30]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[31]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[32]  Jean-François Boulicaut,et al.  Simplest Rules Characterizing Classes Generated by δ-Free Sets , 2003 .

[33]  Ke Wang,et al.  Growing decision trees on support-less association rules , 2000, KDD '00.

[34]  Maguelonne Teisseire,et al.  Incremental mining of sequential patterns in large databases , 2003, Data Knowl. Eng..

[35]  Stefan Mutter,et al.  Classification using Association Rules , 2004 .

[36]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[37]  Yuji Matsumoto,et al.  Automatic Classification of Sentences in the MEDLINE Abstracts : A Case Study of the Power of Word Sequence Features , 2003 .

[38]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[39]  Yuji Matsumoto,et al.  Feature Selection in Categorizing Procedural Expressions , 2003 .

[40]  Yiming Ma,et al.  Improving an Association Rule Based Classifier , 2000, PKDD.

[41]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[42]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[43]  Dimitris Meretakis,et al.  Extending naïve Bayes classifiers using long itemsets , 1999, KDD '99.