Maximal frequent sequences for document classification

Document Classification has attracted several attentions from researchers due to the increase of digital form documents and the need of these documents' organization. One of the most popular approaches to deal with this problem is based on machine learning techniques [1]. However, the result of classification much depends on the linguistic preprocess and the document representation. The dependence is more obvious to languages whose blanks are used to separate not only words but also syllables that constitute words such as Vietnamese, Chinese language. In this paper, we propose a language-independent classifier relied on a flexible feature called Maximal Frequent Sequences (MFSs) [2]. In addition, we design and implement a novel algorithm to find MFSs. Our algorithm follows the MFS definition of H. Ahonen-Myka [2] and ignores the expensive pruning phrase. The experiments shows that our classifying approach achieves the average 85.16% and 89.27% F-measure on 7 classes of the common dataset Reuters-21578 and 5 classes of Vietnamese documents, respectively.

[1]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[2]  Stephen Huffman Acquaintance: Language-Independent Document Categorization by N-Grams , 1995, TREC.

[3]  Patrick Gallinari,et al.  HMM-based passage models for document classification and ranking , 2001 .

[4]  Athanasios Kehagias,et al.  A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms , 2003, Journal of Intelligent Information Systems.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Wataru Ohyama,et al.  Accuracy improvement of automatic text classification based on feature transformation , 2003, DocEng '03.

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  Antoine Doucet,et al.  Non-Contiguous Word Sequences for Information Retrieval , 2004 .

[11]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[12]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[13]  Helena Ahonen-Myka Finding All Maximal Frequent Sequences in Text , 1999 .

[14]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[15]  Guy W. Mineau,et al.  Feature Selection Strategies for Text Categorization , 2003, Canadian Conference on AI.

[16]  Irene Díaz,et al.  A Wrapper Approach with Support Vector Machines for Text Categorization , 2003, IWANN.

[17]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[18]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.