Text Categorization as a Graph Classification Problem

In this paper, we consider the task of text categorization as a graph classification problem. By representing textual documents as graph-of-words instead of historical n-gram bag-of-words, we extract more discriminative features that correspond to long-distance n-grams through frequent subgraph mining. Moreover, by capitalizing on the concept of k-core, we reduce the graph representation to its densest part – its main core – speeding up the feature extraction step for little to no cost in prediction performances. Experiments on four standard text classification datasets show statistically significant higher accuracy and macro-averaged F1-score compared to baseline approaches.

[1]  Katja Filippova,et al.  Multi-Sentence Compression: Finding Shortest Paths in Word Graphs , 2010, COLING.

[2]  Carolyn Penstein Rosé,et al.  Sentiment Classification using Automatically Extracted Subgraph Features , 2010, HLT-NAACL 2010.

[3]  Tatsuya Akutsu,et al.  Extensions of marginalized graph kernels , 2004, ICML.

[4]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[5]  Vladimir Batagelj,et al.  An O(m) Algorithm for Cores Decomposition of Networks , 2003, ArXiv.

[6]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[7]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8]  Yukio Ohsawa,et al.  KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[9]  Fabrizio Sebastiani,et al.  An analysis of the relative hardness of Reuters-21578 subsets: Research Articles , 2005 .

[10]  Mario Vento,et al.  An Improved Algorithm for Matching Large Graphs , 2001 .

[11]  Wei Wang,et al.  GAIA: graph classification using evolutionary computation , 2010, SIGMOD Conference.

[12]  G. Karypis,et al.  Frequent sub-structure-based approaches for classifying chemical compounds , 2005, Third IEEE International Conference on Data Mining.

[13]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[14]  Michalis Vazirgiannis,et al.  Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[15]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[16]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[18]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[19]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[20]  Sebastian Nowozin,et al.  gBoost: a mathematical programming approach to graph classification and regression , 2009, Machine Learning.

[21]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[22]  Rada Mihalcea,et al.  Random Walk Term Weighting for Improved Text Classification , 2007, Int. J. Semantic Comput..

[23]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Stephen B. Seidman,et al.  Network structure and minimum degree , 1983 .

[26]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[27]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[28]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30]  Bassiou Nikoletta,et al.  Word Clustering Using PLSA Enhanced with Long Distance Bigrams , 2010, 2010 20th International Conference on Pattern Recognition.

[31]  R. L. Thorndike Who belongs in the family? , 1953 .

[32]  Ashwin Srinivasan,et al.  The Predictive Toxicology Challenge 2000-2001 , 2001, Bioinform..

[33]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[34]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[35]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[36]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[37]  Hiroya Takamura,et al.  Sentiment Classification Using Word Sub-sequences and Dependency Sub-trees , 2005, PAKDD.

[38]  Frans Coenen,et al.  Text Classification using Graph Mining-based Feature Extraction , 2010, SGAI Conf..

[39]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[40]  Abraham Kandel,et al.  Fast Categorization of Web Documents Represented by Graphs , 2006, WEBKDD.

[41]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[42]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[43]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[44]  Yuji Matsumoto,et al.  A Boosting Algorithm for Classification of Semi-Structured Text , 2004, EMNLP.

[45]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[46]  Michalis Vazirgiannis,et al.  Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction , 2015, ECIR.

[47]  Philip S. Yu,et al.  Near-optimal Supervised Feature Selection among Frequent Subgraphs , 2009, SDM.

[48]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[49]  Constantine Kotropoulos,et al.  Word Clustering Using PLSA Enhanced with Long Distance Bigrams , 2010, ICPR.

[50]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[51]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.