Eigenwords: spectral word embeddings

Spectral learning algorithms have recently become popular in data-rich domains, driven in part by recent advances in large scale randomized SVD, and in spectral estimation of Hidden Markov Models. Extensions of these methods lead to statistical estimation algorithms which are not only fast, scalable, and useful on real data sets, but are also provably correct. Following this line of research, we propose four fast and scalable spectral algorithms for learning word embeddings -- low dimensional real vectors (called Eigenwords) that capture the "meaning" of words from their context. All the proposed algorithms harness the multi-view nature of text data i.e. the left and right context of each word, are fast to train and have strong theoretical properties. Some of the variants also have lower sample complexity and hence higher statistical power for rare words. We provide theory which establishes relationships between these algorithms and optimality criteria for the estimates they provide. We also perform thorough qualitative and quantitative evaluation of Eigenwords showing that simple linear approaches give performance comparable to or superior than the state-of-the-art non-linear deep learning based methods.

[1]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[2]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[3]  Krassimira Ivanova,et al.  Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank , 2002, LREC.

[4]  Harry R. Glahn,et al.  Canonical Correlation and Its Relationship to Discriminant Analysis and Multiple Regression , 1968 .

[5]  Загоровская Ольга Владимировна,et al.  Исследование влияния пола и психологических характеристик автора на количественные параметры его текста с использованием программы Linguistic Inquiry and Word Count , 2015 .

[6]  Véronique Hoste,et al.  SemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation , 2010, SemEval@ACL.

[7]  Eckhard Bick,et al.  Floresta Sintá(c)tica: A treebank for Portuguese , 2002, LREC.

[8]  Byron Boots,et al.  Reduced-Rank Hidden Markov Models , 2009, AISTATS.

[9]  Dean P. Foster,et al.  Multi-View Learning of Word Embeddings via CCA , 2011, NIPS.

[10]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[11]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[12]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[13]  Eric P. Xing,et al.  Spectral Unsupervised Parsing with Additive Tree Metrics , 2014, ACL.

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  G. W. Stewart,et al.  Computer Science and Scientific Computing , 1990 .

[16]  Jakob Uszkoreit,et al.  Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure , 2012, NAACL.

[17]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[18]  Mark Johnson,et al.  SVD and Clustering for Unsupervised POS Tagging , 2010, ACL.

[19]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[20]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[21]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[22]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[23]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[24]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[25]  F. J. Anscombe,et al.  THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA , 1948 .

[26]  Simone Teufel,et al.  The Structure of Scientific Articles - Applications to Citation Indexing and Summarization , 2010, CSLI Studies in Computational Linguistics.

[27]  Dean P. Foster,et al.  Two Step CCA: A new spectral method for estimating vector models of words , 2012, ICML 2012.

[28]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[29]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[30]  Yi Yang,et al.  Learning Representations for Weakly Supervised Natural Language Processing Tasks , 2014, CL.

[31]  Michael Elhadad Book Review: Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper , 2010, CL.

[32]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[33]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[34]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[36]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[37]  Michael Gasser,et al.  HLTDI: CL-WSD Using Markov Random Fields for SemEval-2013 Task 10 , 2013, *SEMEVAL.

[38]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[39]  Hal Daumé Notes on CG and LM-BFGS Optimization of Logistic Regression , 2008 .

[40]  Dean P. Foster Multi-View Dimensionality Reduction via Canonical Correlation Multi-View Dimensionality Reduction via Canonical Correlation Analysis Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimen , 2008 .

[41]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[42]  Tong Zhang,et al.  A Robust Risk Minimization based Named Entity Recognition System , 2003, CoNLL.

[43]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[44]  Dean Alderucci A SPECTRAL ALGORITHM FOR LEARNING HIDDEN MARKOV MODELS THAT HAVE SILENT STATES , 2015 .

[45]  Karl Stratos,et al.  Spectral Learning of Latent-Variable PCFGs , 2012, ACL.

[46]  Alexander Yates,et al.  Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling , 2009, ACL.

[47]  S. Kiaei,et al.  Canonical correlation analysis (CCA) for ARMA spectral estimation , 1989, IEEE International Symposium on Circuits and Systems,.

[48]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[49]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[50]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[51]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[52]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[53]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[54]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[55]  Sham M. Kakade,et al.  Multi-view Regression Via Canonical Correlation Analysis , 2007, COLT.

[56]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[57]  Daoqiang Zhang,et al.  Multi-view dimensionality reduction via canonical random correlation analysis , 2015, Frontiers of Computer Science.

[58]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[59]  G. Stewart Perturbation theory for the singular value decomposition , 1990 .

[60]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[61]  M. Seligman Flourish: A Visionary New Understanding of Happiness and Well-being , 2011 .

[62]  Véronique Hoste,et al.  SemEval-2013 Task 10: Cross-lingual Word Sense Disambiguation , 2013, *SEMEVAL.

[63]  Michael Collins,et al.  Spectral Dependency Parsing with Latent Variables , 2012, EMNLP-CoNLL.