Bag of textual graphs (BoTG): A general graph‐based text representation model

Text representation models are the fundamental basis for information retrieval and text mining tasks. Although different text models have been proposed, they typically target specific task aspects in isolation, such as time efficiency, accuracy, or applicability for different scenarios. Here we present Bag of Textual Graphs (BoTG), a general text representation model that addresses these three requirements at the same time. The proposed textual representation is based on a graph‐based scheme that encodes term proximity and term ordering, and represents text documents into an efficient vector space that addresses all these aspects as well as provides discriminative textual patterns. Extensive experiments are conducted in two experimental scenarios—classification and retrieval—considering multiple well‐known text collections. We also compare our model against several methods from the literature. Experimental results demonstrate that our model is generic enough to handle different tasks and collections. It is also more efficient than the widely used state‐of‐the‐art methods in textual classification and retrieval tasks, with a competitive effectiveness, sometimes with gains by large margins.

[1]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Kaspar Riesen,et al.  Graph Embedding in Vector Spaces by Means of Prototype Selection , 2007, GbRPR.

[3]  Tommy W. S. Chow,et al.  A multi-level matching method with hybrid similarity for document retrieval , 2012, Expert Syst. Appl..

[4]  Miro Kraetzl,et al.  Graph distances using graph union , 2001, Pattern Recognit. Lett..

[5]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[6]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[7]  Mathias Niepert,et al.  Learning Convolutional Neural Networks for Graphs , 2016, ICML.

[8]  Siome Goldenstein,et al.  Graph-based bag-of-words for classification , 2018, Pattern Recognit..

[9]  Tommy W. S. Chow,et al.  A new document representation using term frequency and vectorized graph connectionists with application to document retrieval , 2009, Expert Syst. Appl..

[10]  Jeffrey Xu Yu,et al.  Leveraging Graph Dimensions in Online Graph Search , 2014, Proc. VLDB Endow..

[11]  Michalis Vazirgiannis,et al.  Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[12]  Ernest Valveny,et al.  Graph embedding in vector spaces by node attribute statistics , 2012, Pattern Recognit..

[13]  Pengtao Xie,et al.  Diversifying Restricted Boltzmann Machine for Document Modeling , 2015, KDD.

[14]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Matt J. Kusner,et al.  Supervised Word Mover's Distance , 2016, NIPS.

[17]  Abraham Kandel,et al.  The hybrid representation model for web document classification , 2008, Int. J. Intell. Syst..

[18]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[19]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[20]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[21]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[22]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[23]  Kaspar Riesen,et al.  Improving vector space embedding of graphs through feature selection algorithms , 2011, Pattern Recognit..

[24]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Abraham Kandel,et al.  Graph-Theoretic Techniques for Web Content Mining , 2005, Series in Machine Perception and Artificial Intelligence.

[26]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[27]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[28]  Sergey Brin,et al.  Reprint of: The anatomy of a large-scale hypertextual web search engine , 2012, Comput. Networks.

[29]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[30]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[31]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[32]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[33]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.