论文信息 - Text Categorization as a Graph Classification Problem - 字舞流文

Text Categorization as a Graph Classification Problem

In this paper, we consider the task of text categorization as a graph classification problem. By representing textual documents as graph-of-words instead of historical n-gram bag-of-words, we extract more discriminative features that correspond to long-distance n-grams through frequent subgraph mining. Moreover, by capitalizing on the concept of k-core, we reduce the graph representation to its densest part – its main core – speeding up the feature extraction step for little to no cost in prediction performances. Experiments on four standard text classification datasets show statistically significant higher accuracy and macro-averaged F1-score compared to baseline approaches.

Michalis Vazirgiannis | François Rousseau | Emmanouil Kiagias | M. Vazirgiannis | F. Rousseau | Emmanouil Kiagias | E. Kiagias

[1] Katja Filippova,et al. Multi-Sentence Compression: Finding Shortest Paths in Word Graphs , 2010, COLING.

[2] Carolyn Penstein Rosé,et al. Sentiment Classification using Automatically Extracted Subgraph Features , 2010, HLT-NAACL 2010.

[3] Tatsuya Akutsu,et al. Extensions of marginalized graph kernels , 2004, ICML.

[4] Rada Mihalcea,et al. TextRank: Bringing Order into Text , 2004, EMNLP.

[5] Vladimir Batagelj,et al. An O(m) Algorithm for Cores Decomposition of Networks , 2003, ArXiv.

[6] George Karypis,et al. Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[7] Jiawei Han,et al. gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8] Yukio Ohsawa,et al. KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[9] Fabrizio Sebastiani,et al. An analysis of the relative hardness of Reuters-21578 subsets: Research Articles , 2005 .

[10] Mario Vento,et al. An Improved Algorithm for Matching Large Graphs , 2001 .

[11] Wei Wang,et al. GAIA: graph classification using evolutionary computation , 2010, SIGMOD Conference.

[12] G. Karypis,et al. Frequent sub-structure-based approaches for classifying chemical compounds , 2005, Third IEEE International Conference on Data Mining.

[13] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[14] Michalis Vazirgiannis,et al. Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[15] Charu C. Aggarwal,et al. A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[16] John Blitzer,et al. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[17] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[18] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[19] Johannes Fürnkranz,et al. A Study Using $n$-gram Features for Text Categorization , 1998 .

[20] Sebastian Nowozin,et al. gBoost: a mathematical programming approach to graph classification and regression , 2009, Machine Learning.

[21] Fabrizio Sebastiani,et al. An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[22] Rada Mihalcea,et al. Random Walk Term Weighting for Improved Text Classification , 2007, Int. J. Semantic Comput..

[23] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.

[24] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25] Stephen B. Seidman,et al. Network structure and minimum degree , 1983 .

[26] Thomas Gärtner,et al. On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[27] Levent Özgür,et al. Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[28] S. V. N. Vishwanathan,et al. Graph kernels , 2007 .

[29] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30] Bassiou Nikoletta,et al. Word Clustering Using PLSA Enhanced with Long Distance Bigrams , 2010, 2010 20th International Conference on Pattern Recognition.

[31] R. L. Thorndike. Who belongs in the family? , 1953 .

[32] Ashwin Srinivasan,et al. The Predictive Toxicology Challenge 2000-2001 , 2001, Bioinform..

[33] Gábor Csárdi,et al. The igraph software package for complex network research , 2006 .

[34] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[35] Christina Lioma,et al. Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[36] Wei Wang,et al. Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[37] Hiroya Takamura,et al. Sentiment Classification Using Word Sub-sequences and Dependency Sub-trees , 2005, PAKDD.

[38] Frans Coenen,et al. Text Classification using Graph Mining-based Feature Extraction , 2010, SGAI Conf..

[39] Georgios Paliouras,et al. An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[40] Abraham Kandel,et al. Fast Categorization of Web Documents Represented by Graphs , 2006, WEBKDD.

[41] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[42] Hisashi Kashima,et al. Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[43] Joost N. Kok,et al. A quickstart in frequent structure mining can make a difference , 2004, KDD.

[44] Yuji Matsumoto,et al. A Boosting Algorithm for Classification of Semi-Structured Text , 2004, EMNLP.

[45] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[46] Michalis Vazirgiannis,et al. Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction , 2015, ECIR.

[47] Philip S. Yu,et al. Near-optimal Supervised Feature Selection among Frequent Subgraphs , 2009, SDM.

[48] Ana Margarida de Jesus,et al. Improving Methods for Single-label Text Categorization , 2007 .

[49] Constantine Kotropoulos,et al. Word Clustering Using PLSA Enhanced with Long Distance Bigrams , 2010, ICPR.

[50] W. Bruce Croft,et al. Combining classifiers in text categorization , 1996, SIGIR '96.

[51] Yuji Matsumoto,et al. An Application of Boosting to Graph Classification , 2004, NIPS.