Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.

[1]  Jason S. Kessler,et al.  Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ , 2017, ACL.

[2]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[3]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[4]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[5]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[6]  Peng Zhang,et al.  CASA-NLU: Context-Aware Self-Attentive Natural Language Understanding for Task-Oriented Chatbots , 2019, EMNLP/IJCNLP.

[7]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[8]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[11]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Shou-De Lin,et al.  Unsupervised Ranking using Graph Structures and Node Attributes , 2017, WSDM.

[13]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[14]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[15]  Katrin Kirchhoff,et al.  Simple, Fast, Accurate Intent Classification and Slot Labeling for Goal-Oriented Dialogue Systems , 2019, SIGdial.

[16]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[19]  Francesco Caltagirone,et al.  Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Philip S. Yu,et al.  Joint Slot Filling and Intent Detection via Capsule Neural Networks , 2018, ACL.

[22]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[25]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[26]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[27]  Kenney Ng,et al.  Clustervision: Visual Supervision of Unsupervised Clustering , 2018, IEEE Transactions on Visualization and Computer Graphics.