Text Classification with Heterogeneous Information Network Kernels

Text classification is an important problem with many applications. Traditional approaches represent text as a bag-of-words and build classifiers based on this representation. Rather than words, entity phrases, the relations between the entities, as well as the types of the entities and relations carry much more information to represent the texts. This paper presents a novel text as network classification framework, which introduces 1) a structured and typed heterogeneous information networks (HINs) representation of texts, and 2) a meta-path based approach to link texts. We show that with the new representation and links of texts, the structured and typed information of entities and relations can be incorporated into kernels. Particularly, we develop both simple linear kernel and indefinite kernel based on meta-paths in the HIN representation of texts, where we call them HIN-kernels. Using Freebase, a well-known world knowledge base, to construct HIN for texts, our experiments on two benchmark datasets show that the indefinite HIN-kernel based on weighted meta-paths outperforms the state-of-the-art methods and other HIN-kernels.

[1]  Colin Campbell,et al.  Analysis of SVM with Indefinite Kernels , 2009, NIPS.

[2]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[3]  Simone Paolo Ponzetto,et al.  Knowledge-based graph document modeling , 2014, WSDM.

[4]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[7]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[8]  Philip S. Yu,et al.  Meta path-based collective classification in heterogeneous information networks , 2012, CIKM.

[9]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[10]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[11]  Michalis Vazirgiannis,et al.  Convolutional Sentence Kernel from Word Embeddings for Short Text Categorization , 2015, EMNLP.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Heng Ji,et al.  Constrained Information-Theoretic Tripartite Graph Clustering to Identify Semantically Similar Relations , 2015, IJCAI.

[14]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[15]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[16]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[17]  Alexandre d'Aspremont,et al.  Support vector machine classification with indefinite kernels , 2007, Math. Program. Comput..

[18]  Peter H. Maserick,et al.  HARMONIC ANALYSIS ON SEMIGROUPS (Graduate Texts in Mathematics, 100) , 1985 .

[19]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[20]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[21]  Haixun Wang,et al.  Open Domain Short Text Conceptualization: A Generative + Descriptive Modeling Approach , 2015, IJCAI.

[22]  Jian Hu,et al.  Improving Text Classification by Using Encyclopedia Knowledge , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[24]  Michalis Vazirgiannis,et al.  Text Categorization as a Graph Classification Problem , 2015, ACL.

[25]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[26]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[27]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[28]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[29]  Rada Mihalcea,et al.  Random Walk Term Weighting for Improved Text Classification , 2007, Int. J. Semantic Comput..

[30]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[31]  Ming Zhou,et al.  Paraphrasing Adaptation for Web Search Ranking , 2013, ACL.

[32]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[33]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[34]  J. Williamson Harmonic Analysis on Semigroups , 1967 .

[35]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[36]  Xuemin Lin,et al.  Term Graph Model for Text Classification , 2005, ADMA.

[37]  Florence d'Alché-Buc,et al.  Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[38]  Dan Roth,et al.  Unsupervised Sparse Vector Densification for Short Text Similarity , 2015, NAACL.

[39]  Jiawei Han,et al.  KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks , 2015, 2015 IEEE International Conference on Data Mining.

[40]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[42]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[43]  Dan Roth,et al.  Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks , 2015, KDD.