Text Classification using Graph Mining-based Feature Extraction

A graph-based approach to document classification is descri bed in this paper. The graph representation offers the advantage that i t allows for a much more expressive document encoding than the more standard bag of w ords/phrases approach, and consequently gives an improved classification a ccur cy. Document sets are represented as graph sets to which a weighted graph minin g algorithm is applied to extract frequent subgraphs, which are then further proce ssed to produce feature vectors (one per document) for classification. Weighted sub graph mining is used to ensure classification effectiveness and computational e fficiency; only the most significant subgraphs are extracted. The approach is valida ted and evaluated using several popular classification algorithms together with a r eal world textual data set. The results demonstrate that the approach can outperform ex isting text classification algorithms on some dataset. When the size of dataset increase d, further processing on extracted frequent features is essential. Chuntao Jiang The University of Liverpool, Department of Computer Science, A shton Building, Ashton Street, Liverpool, L69 3BX, United Kingdom e-mail: c.jiang@liv.ac.uk Frans Coenen The University of Liverpool, Department of Computer Science, A shton Building, Ashton Street, Liverpool, L69 3BX, United Kingdom e-mail: coenen@liv.ac.uk Robert Sanderson The University of Liverpool, Department of Computer Science, A shton Building, Ashton Street, Liverpool, L69 3BX, United Kingdom e-mail: azaroth@liv.ac.uk Michele Zito The University of Liverpool, Department of Computer Science, A shton Building, Ashton Street, Liverpool, L69 3BX, United Kingdom e-mail: michele@liv.ac.uk Chuntao Jiang, Frans Coenen, Robert Sanderson, and Michele Zi to

[1]  G. Karypis,et al.  Frequent sub-structure-based approaches for classifying chemical compounds , 2005, Third IEEE International Conference on Data Mining.

[2]  Philip S. Yu,et al.  Efficient mining of weighted association rules (WAR) , 2000, KDD '00.

[3]  Kai-Uwe Kühnberger,et al.  Structure-Sensitive Learning of Text Types , 2007, Australian Conference on Artificial Intelligence.

[4]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[5]  Frans Coenen,et al.  Obtaining best parameter values for accurate classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Xuemin Lin,et al.  Term Graph Model for Text Classification , 2005, ADMA.

[7]  Abraham Kandel,et al.  Graph-Theoretic Techniques for Web Content Mining , 2005, Series in Machine Perception and Artificial Intelligence.

[8]  John J. Leggett,et al.  WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity , 2006, SDM.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[11]  Diane J. Cook,et al.  Text Classification Using Graph-Encoded Linguistic Elements , 2005, FLAIRS Conference.

[12]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Abraham Kandel,et al.  Fast Categorization of Web Documents Represented by Graphs , 2006, WEBKDD.

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  H. T. Reynolds,et al.  The analysis of cross-classifications , 1977 .

[16]  Hyu Chan Park,et al.  Mining Weighted Frequent Patterns from Path Traversals on Weighted Graph , 2007 .

[17]  Unil Yun WIS: Weighted Interesting Sequential Pattern Mining with a Similar Level of Support and/or Weight , 2007 .

[18]  Fionn Murtagh,et al.  Weighted Association Rule Mining using weighted support and significance framework , 2003, KDD '03.

[19]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[20]  John J. Leggett,et al.  WFIM: Weighted Frequent Itemset Mining with a weight range and a minimum weight , 2005, SDM.

[21]  Ada Wai-Chee Fu,et al.  Mining association rules with weighted items , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[22]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.