Knowledge-driven graph similarity for text classification

Automatic text classification using machine learning is significantly affected by the text representation model. The structural information in text is necessary for natural language understanding, which is usually ignored in vector-based representations. In this paper, we present a graph kernel-based text classification framework which utilises the structural information in text effectively through the weighting and enrichment of a graph-based representation. We introduce weighted co-occurrence graphs to represent text documents, which weight the terms and their dependencies based on their relevance to text classification. We propose a novel method to automatically enrich the weighted graphs using semantic knowledge in the form of a word similarity matrix. The similarity between enriched graphs, knowledge-driven graph similarity, is calculated using a graph kernel. The semantic knowledge in the enriched graphs ensures that the graph kernel goes beyond exact matching of terms and patterns to compute the semantic similarity of documents. In the experiments on sentiment classification and topic classification tasks, our knowledge-driven similarity measure significantly outperforms the baseline text similarity measures on five benchmark text classification datasets.

[1]  Florence d'Alché-Buc,et al.  Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[2]  Dirk Hovy,et al.  A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations , 2013, EMNLP.

[3]  Min Song,et al.  Text Categorization of Biomedical Data Sets Using Graph Kernels and a Controlled Vocabulary , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  BaldiPierre,et al.  2005 Speical Issue , 2005 .

[5]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[6]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[7]  Yannis Stavrakas,et al.  Shortest-Path Graph Kernels for Document Similarity , 2017, EMNLP.

[8]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[9]  Zhiwei Lin,et al.  Centrality-Based Approach for Supervised Term Weighting , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[10]  Iraklis Varlamis,et al.  A Knowledge-Based Semantic Kernel for Text Classification , 2011, SPIRE.

[11]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[12]  Banu Diri,et al.  A novel semantic smoothing kernel for text classification with class-based weighting , 2015, Knowl. Based Syst..

[13]  Abraham Kandel,et al.  Clustering of Web Documents using a Graph Model , 2003, Web Document Analysis.

[14]  Abraham Kandel,et al.  Classification of Web documents using a graph model , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[15]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[16]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[17]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[18]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[19]  Kaspar Riesen,et al.  Recent advances in graph-based pattern recognition with applications in document analysis , 2011, Pattern Recognit..

[20]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[21]  Rada Mihalcea,et al.  Random Walk Term Weighting for Improved Text Classification , 2007, Int. J. Semantic Comput..

[22]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[23]  Stephan Bloehdorn,et al.  Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[24]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[25]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[26]  Michalis Vazirgiannis,et al.  Convolutional Sentence Kernel from Word Embeddings for Short Text Categorization , 2015, EMNLP.

[27]  Paolo Ferragina,et al.  Classification of Short Texts by Deploying Topical Annotations , 2012, ECIR.

[28]  Wei Li,et al.  Sprinkled semantic diffusion kernel for word sense disambiguation , 2017, Eng. Appl. Artif. Intell..

[29]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[30]  Murat Can Ganiz,et al.  A corpus-based semantic kernel for text classification by using meaning values of terms , 2015, Eng. Appl. Artif. Intell..

[31]  Fragkiskos D. Malliaros,et al.  Graph-based term weighting for text categorization , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[32]  Carmen Banea,et al.  Random-Walk Term Weighting for Improved Text Classification , 2006 .

[33]  Xuemin Lin,et al.  Term Graph Model for Text Classification , 2005, ADMA.

[34]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[35]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[36]  Hans-Peter Kriegel,et al.  Shortest-path kernels on graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[37]  Teresa Gonçalves,et al.  Using Graph-Kernels to Represent Semantic Information in Text Classification , 2009, MLDM.

[38]  Zhiwei Lin,et al.  Supervised Graph-Based Term Weighting Scheme for Effective Text Classification , 2016, ECAI.

[39]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .