Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database

Word embeddings have been extensively studied in large text datasets. However, only a few studies analyze semantic representations of small corpora, particularly relevant in single-person text production studies. In the present paper, we compare Skip-gram and LSA capabilities in this scenario, and we test both techniques to extract relevant semantic patterns in single-series dreams reports. LSA showed better performance than Skip-gram in small size training corpus in two semantic tests. As a study case, we show that LSA can capture relevant words associations in dream reports series, even in cases of small number of dreams or low-frequency words. We propose that LSA can be used to explore words associations in dreams reports, which could bring new insight into this classic research area of psychology

[1]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[2]  M. Sigman,et al.  Automated analysis of free speech predicts psychosis onset in high-risk youths , 2015, npj Schizophrenia.

[3]  Malti Patel,et al.  Extracting Semantic Representations from Large Text Corpora , 1997, NCPW.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Eyal Sagi,et al.  Identifying Issue Frames in Text , 2013, PloS one.

[6]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[7]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[8]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[9]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[10]  T. Nielsen,et al.  The Typical Dreams of Canadian University Students , 2003 .

[11]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[12]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[13]  R. M. Griffith,et al.  The Universality of Typical Dreams: Japanese vs. Americans , 1958 .

[14]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[15]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[16]  Steven Skiena,et al.  Statistically Significant Detection of Linguistic Change , 2014, WWW.

[17]  Manuel J. Fonseca,et al.  Automatic Estimation of the LSA Dimension , 2011, KDIR.

[18]  John A Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD , 2012, Behavior Research Methods.

[19]  Arthur C. Graesser,et al.  Strengths, Limitations, and Extensions of LSA , 2007 .

[20]  Sidarta Ribeiro,et al.  Graph analysis of dream reports is especially informative about psychosis , 2014, Scientific Reports.

[21]  Mariano Sigman,et al.  Scale-Invariant Transition Probabilities in Free Word Association Trajectories , 2009, Front. Integr. Neurosci..

[22]  Nili T. Kirschner Medication and Dreams: Changes in Dream Content After Drug Treatment , 1999 .

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[24]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[25]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[26]  Carlos Diuk,et al.  A quantitative philology of introspection , 2012, Front. Integr. Neurosci..

[27]  George M. Giaglis,et al.  Semantically aware time evolution tracking of communities in co-authorship networks , 2015, Panhellenic Conference on Informatics.

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[30]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[31]  T. Landauer LSA as a Theory of Meaning , 2007 .

[32]  Y Kamitani,et al.  Neural Decoding of Visual Imagery During Sleep , 2013, Science.

[33]  G. William Domhoff,et al.  Using Content Analysis to Study Dreams , 2001 .

[34]  G. William Domhoff,et al.  Studying dream content using the archive and search engine on DreamBank.net , 2008, Consciousness and Cognition.