How to Leverage a Multi-layered Transformer Language Model for Text Clustering: an Ensemble Approach

Pre-trained Transformer-based word embeddings are now widely used in text mining where they are known to significantly improve supervised tasks such as text classification, named entity recognition and question answering. Since the Transformer models create several different embeddings for the same input, one at each layer of their architecture, various studies have already tried to identify those of these embeddings that most contribute to the success of the above-mentioned tasks. In contrast the same performance analysis has not yet been carried out in the unsupervised setting. In this paper we evaluate the effectiveness of Transformer models on the important task of text clustering. In particular, we present a clustering ensemble approach that harnesses all the network's layers. Numerical experiments carried out on real datasets with different Transformer models show the effectiveness of the proposed method compared to several baselines.

[1]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[2]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[3]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[4]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[5]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[6]  Mohamed Nadif,et al.  Spectral Clustering via Ensemble Deep Autoencoder Learning (SC-EDAE) , 2019, Pattern Recognit..

[7]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[8]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[9]  Vladimir B. Berikov,et al.  Ensemble clustering based on weighted co-association matrices: Error bound and convergence properties , 2017, Pattern Recognit..

[10]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[11]  Vikas Raunak,et al.  Effective Dimensionality Reduction for Word Embeddings , 2017, RepL4NLP@ACL.

[12]  Furu Wei,et al.  Visualizing and Understanding the Effectiveness of BERT , 2019, EMNLP.

[13]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[14]  Goran Glavas,et al.  Probing Pretrained Language Models for Lexical Semantics , 2020, EMNLP.

[15]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[16]  Sadaaki Miyamoto,et al.  Spherical k-Means++ Clustering , 2015, MDAI.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Lazhar Labiod,et al.  Ensemble Block Co-clustering: A Unified Framework for Text Data , 2020, CIKM.

[19]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[20]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[21]  M. Cugmas,et al.  On comparing partitions , 2015 .

[22]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  Alexander Löser,et al.  How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations , 2019, CIKM.

[25]  Mason A. Porter,et al.  Robust Detection of Dynamic Community Structure in Networks , 2012, Chaos.