Distributed Representation Prediction for Generalization to New Words

Learning distributed representations of symbols (e.g. words) has been used in several Natural Language Processing systems. Such representations can capture semantic or syntactic similarities between words, which permit to ght the curse of dimensionality when considering sequences of such words. Unfortunately, because these representations are learned only for a previously determined vocabulary of words, it is not clear how to obtain representations for new words. We present here an approach which gets around this problem by considering the distributed representations as predictions from low-level or domain-knowledge features of words. We report experiments on a Part Of Speech tagging task, which demonstrates the success of this approach in learning meaningful representations and in providing improved accuracy, especially for new words.

[1]  Trevor F. Cox,et al.  Metric multidimensional scaling , 2000 .

[2]  Ahmad Emami,et al.  Training Connectionist Models for the Structured Language Model , 2003, EMNLP.

[3]  Samy Bengio,et al.  A Neural Network for Text Representation , 2005, ICANN.

[4]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  Yuji Matsumoto,et al.  Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines , 2001, NLPRS.

[7]  Geoffrey E. Hinton Learning distributed representations of concepts. , 1989 .

[8]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[9]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[10]  Nicolas Le Roux,et al.  Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering , 2003, NIPS.

[11]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  John Blitzer,et al.  Hierarchical Distributed Representations for Statistical Language Modeling , 2004, NIPS.

[15]  I. Jolliffe Principal Component Analysis , 2002 .

[16]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[17]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.