A simple and general method for semi-supervised learning

If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining di erent word representations. You can download our word features, for o -the-shelf use in existing NLP systems, as well as our code, here: http://metaoptimize. com/projects/wordreprs/

[1]  Yoshua Bengio,et al.  Neural net language models , 2008, Scholarpedia.

[2]  Marie Candito,et al.  Improving generative statistical parsing with semi-supervised word clustering , 2009, IWPT.

[3]  Valentin I. Spitkovsky,et al.  From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency Parsing , 2010, NAACL.

[4]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[5]  Tong Zhang,et al.  A Robust Risk Minimization based Named Entity Recognition System , 2003, CoNLL.

[6]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[7]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[8]  Jaakko J. Väyrynen,et al.  Comparison of Independent Component Analysis and Singular Value Decomposition in Word Context Analysis , 2005 .

[9]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[10]  Alexander Yates,et al.  Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling , 2009, ACL.

[11]  Akira Ushioda,et al.  Hierarchical Clustering of Words , 1996, COLING.

[12]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[13]  Timo Honkela,et al.  Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map , 1995 .

[14]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[15]  Xavier Carreras,et al.  An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing , 2009, EMNLP.

[16]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[18]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[19]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[20]  Magnus Sahlgren,et al.  Vector-based semantic analysis: representing word meanings based on random labels , 2001 .

[21]  Wei Li,et al.  Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[22]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[23]  Dekang Lin,et al.  Phrase Clustering for Discriminative Learning , 2009, ACL.

[24]  Hai Zhao,et al.  Multilingual Dependency Learning: A Huge Feature Engineering Method to Semantic Dependency Parsing , 2009, CoNLL Shared Task.

[25]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[26]  Timo Honkela,et al.  Towards explicit semantic features using independent component analysis , 2007 .

[27]  Reut Tsarfaty,et al.  Enhancing Unlexicalized Parsing Performance Using a Wide Coverage Lexicon, Fuzzy Tag-Set Mapping, and EM-HMM-Based Lexical Probabilities , 2009, EACL.

[28]  Marie-Francine Moens,et al.  Semi-supervised Semantic Role Labeling Using the Latent Words Language Model , 2009, EMNLP.

[29]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[30]  T. Kohonen,et al.  Self-organizing semantic maps , 1989, Biological Cybernetics.

[31]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[32]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[33]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[34]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[35]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[36]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[37]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[38]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[39]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[40]  Yoshua Bengio,et al.  Quick Training of Probabilistic Neural Nets by Importance Sampling , 2003, AISTATS.

[41]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[42]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[43]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[44]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..