Using Graphs and Semantic Information to Improve Text Classifiers

Text classification using semantic information is the latest trend of research due to its greater potential to accurately represent text content compared with bag-of-words (BOW) approaches. On the other hand, representation of semantics through graphs has several advantages over the traditional representation of feature vector. Therefore, error tolerant graph matching techniques can be used for text classification. Nevertheless, very few methodologies exist in the literature which use semantic representation through graphs. In the present work, a methodology has been proposed to represent semantic information from a summarized text into a graph. The discourse representation structure of a text is utilized in order to represent its semantic content and, afterwards, it is transformed into a graph. Five different graph matching techniques based on Maximum Common Subgraphs (mcs) and Minimum Common Supergraphs (MCS) are evaluated on 20 classes from the Reuters dataset taking 10 docs of each class for both training and testing purposes using the k-NN classifier. From the results it can be observed that the technique has potential to perform text classification as well as the traditional BOW approaches. Moreover a majority voting based combination of the semantic representation and a traditional BOW approach provided an improved recognition accuracy on the same data set.

[1]  Michael Himsolt,et al.  GML: A portable Graph File Format , 2010 .

[2]  Horst Bunke,et al.  A Comparison of Algorithms for Maximum Common Subgraph on Randomly Connected Graphs , 2002, SSPR/SPR.

[3]  Varol Akman,et al.  Book Review -- Hans Kamp and Uwe Reyle, From Discourse to Logic: Introduction to Model-theoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory , 1995 .

[4]  Qiang Shen,et al.  A Rough Set-Based Approach to Text Classification , 1999, RSFDGrC.

[5]  Zhijing Liu,et al.  Graph-based KNN text classification , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[6]  Yuefeng Li,et al.  Rough Set Based Approach to Text Classification , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[7]  Kaspar Riesen,et al.  Graph Classification and Clustering Based on Vector Space Embedding , 2010, Series in Machine Perception and Artificial Intelligence.

[8]  Johan Bos,et al.  Wide-Coverage Semantic Analysis with Boxer , 2008, STEP.

[9]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[10]  Min Song,et al.  Text Categorization of Biomedical Data Sets Using Graph Kernels and a Controlled Vocabulary , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.