On Extending NLP Techniques from the Categorical to the Latent Space: KL Divergence, Zipf's Law, and Similarity Search

Despite the recent successes of deep learning in natural language processing (NLP), there remains widespread usage of and demand for techniques that do not rely on machine learning. The advantage of these techniques is their interpretability and low cost when compared to frequently opaque and expensive machine learning models. Although they may not be be as performant in all cases, they are often sufficient for common and relatively simple problems. In this paper, we aim to modernize these older methods while retaining their advantages by extending approaches from categorical or bag-of-words representations to word embeddings representations in the latent space. First, we show that entropy and Kullback-Leibler divergence can be efficiently estimated using word embeddings and use this estimation to compare text across several categories. Next, we recast the heavy-tailed distribution known as Zipf's law that is frequently observed in the categorical space to the latent space. Finally, we look to improve the Jaccard similarity measure for sentence suggestion by introducing a new method of identifying similar sentences based on the set cover problem. We compare the performance of this algorithm against several baselines including Word Mover's Distance and the Levenshtein distance.

[1]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[2]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[3]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[4]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[5]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[6]  Edward Chlebus,et al.  Estimating parameters of the Pareto distribution by means of Zipf's law: application to Internet research , 2005, GLOBECOM '05. IEEE Global Telecommunications Conference, 2005..

[7]  Anne-Lise Veuthey,et al.  Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation , 2003, ISMB.

[8]  Hojjat Adeli,et al.  Permutation Jaccard Distance-Based Hierarchical Clustering to Estimate EEG Network Density Modifications in MCI Subjects , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Lahomtoires d'Electronique AN INFORMATIONAL THEORY OF THE STATISTICAL STRUCTURE OF LANGUAGE 36 , 2010 .

[10]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[11]  Thomas M Cover,et al.  Differential Entropy , 2014 .

[12]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[13]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  S. Niwattanakul,et al.  Using of Jaccard Coefficient for Keywords Similarity , 2022 .

[16]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[17]  David P. Williamson,et al.  The Design of Approximation Algorithms , 2011 .

[18]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[19]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[20]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[21]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[22]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[23]  Kumiko Tanaka-Ishii,et al.  Do neural nets learn statistical laws behind natural language? , 2017, PloS one.

[24]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[25]  Eduardo Laber,et al.  Speeding up Word Mover's Distance and its variants via properties of distances between embeddings , 2020, ECAI.

[26]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[27]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[28]  Sarah Filippi,et al.  Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[30]  Qi Kang,et al.  Drifted Twitter Spam Classification Using Multiscale Detection Test on K-L Divergence , 2019, IEEE Access.

[31]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[32]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[33]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[34]  Y. Takane,et al.  Multidimensional Scaling I , 2015 .

[35]  Jacques Savoy,et al.  Feature selections for authorship attribution , 2013, SAC '13.

[36]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[37]  Barnabás Póczos,et al.  Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions , 2011, UAI.

[38]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[39]  Francis Jack Smith,et al.  Extension of Zipf’s Law to Words and Phrases , 2002, COLING.

[40]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[41]  Mi Zhou,et al.  Centroid Estimation Based on Symmetric KL Divergence for Multinomial Text Classification Problem , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[42]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[43]  N. Given Entropy-Based Authorship Search in Large Document Collections , 2006 .

[44]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[45]  Justin Zobel,et al.  Using Relative Entropy for Authorship Attribution , 2006, AIRS.

[46]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[47]  Peter E. Latham,et al.  Zipf’s Law Arises Naturally When There Are Underlying, Unobserved Variables , 2016, PLoS Comput. Biol..

[48]  Raihana Ferdous,et al.  An efficient k-means algorithm integrated with Jaccard distance measure for document clustering , 2009, 2009 First Asian Himalayas International Conference on Internet.

[49]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[50]  Yading Yuan,et al.  Automatic Skin Lesion Segmentation Using Deep Fully Convolutional Networks With Jaccard Distance , 2017, IEEE Transactions on Medical Imaging.

[51]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[52]  G. Crooks On Measures of Entropy and Information , 2015 .