Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology-Based Representations

We investigate the pertinence of methods from algebraic topology for text data analysis. These methods enable the development of mathematically-principled isometric-invariant mappings from a set of vectors to a document embedding, which is stable with respect to the geometry of the document in the selected metric space. In this work, we evaluate the utility of these topology-based document representations in traditional NLP tasks, specifically document clustering and sentiment classification. We find that the embeddings do not benefit text analysis. In fact, performance is worse than simple techniques like $\textit{tf-idf}$, indicating that the geometry of the document does not provide enough variability for classification on the basis of topic or sentiment in the chosen datasets.

[1]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[4]  Guergana K. Savova,et al.  Unsupervised Document Classification with Informed Topic Models , 2016, BioNLP@ACL.

[5]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[6]  Alexander M. Bronstein,et al.  Efficient Computation of Isometry-Invariant Distances Between Surfaces , 2006, SIAM J. Sci. Comput..

[7]  Herbert Edelsbrunner,et al.  An incremental algorithm for Betti numbers of simplicial complexes on the 3-sphere , 1995, Comput. Aided Geom. Des..

[8]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[9]  Kelin Xia,et al.  Persistent homology analysis of protein structure, flexibility, and folding , 2014, International journal for numerical methods in biomedical engineering.

[10]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[11]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[12]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[13]  Jaegul Choo,et al.  Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering , 2014 .

[14]  Rada Mihalcea,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Langu , 2011, ACL 2011.

[15]  Herbert Edelsbrunner,et al.  Topological Persistence and Simplification , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[16]  Leonidas J. Guibas,et al.  Gromov‐Hausdorff Stable Signatures for Shapes using Persistence , 2009, Comput. Graph. Forum.

[17]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[18]  Kyle Fox,et al.  Computing the Gromov-Hausdorff Distance for Metric Trees , 2015, ISAAC.

[19]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[20]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[21]  Steve Oudot,et al.  Eurographics Symposium on Geometry Processing 2015 Stable Topological Signatures for Points on 3d Shapes , 2022 .

[22]  Guillermo Sapiro,et al.  A Theoretical and Computational Framework for Isometry Invariant Recognition of Point Cloud Data , 2005, Found. Comput. Math..

[23]  Kentaro Inui,et al.  Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables , 2010, NAACL.

[24]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[25]  Facundo Mémoli,et al.  Eurographics Symposium on Point-based Graphics (2007) on the Use of Gromov-hausdorff Distances for Shape Comparison , 2022 .

[26]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[27]  Mikhael Gromov Structures métriques pour les variétés riemanniennes , 1981 .

[28]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[29]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[30]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[31]  Xiaojin Zhu,et al.  Persistent Homology: An Introduction and a New Text Representation for Natural Language Processing , 2013, IJCAI.