The Effects of Data Size and Frequency Range on Distributional Semantic Models

This paper investigates the effects of data size and frequency range on distributional semantic models. We compare the performance of a number of representative models for several test settings over data of varying sizes, and over test items of various frequency. Our results show that neural network-based models underperform when the data is small, and that the most reliable model over data of varying sizes and frequency ranges is the inverted factorized model.

[1]  John A Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD , 2012, Behavior Research Methods.

[2]  Michael N. Jones,et al.  Comparing Predictive and Co-occurrence Based Models of Lexical Semantics Trained on Child-directed Speech , 2016, CogSci.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[5]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[6]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[7]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[8]  P. Kanerva,et al.  Permutations as a means to encode order in word space , 2008 .

[9]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[10]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[11]  Magnus Sahlgren,et al.  Factorization of Latent Variables in Distributional Semantic Models , 2015, EMNLP.

[12]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[13]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[14]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[15]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[16]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[17]  Fredrik Olsson,et al.  The Gavagai Living Lexicon , 2016, LREC.

[18]  Michael N. Jones,et al.  Redundancy in Perceptual and Linguistic Experience: Comparing Feature-Based and Distributional Models of Semantic Representation , 2010, Top. Cogn. Sci..

[19]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[20]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.