A unified architecture for natural language processing: deep neural networks with multitask learning

We describe a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense (grammatically and semantically) using a language model. The entire network is trained jointly on all these tasks using weight-sharing, an instance of multitask learning. All the tasks use labeled data except the language model which is learnt from unlabeled text and represents a novel form of semi-supervised learning for the shared tasks. We show how both multitask learning and semi-supervised learning improve the generalization of the shared tasks, resulting in state-of-the-art-performance.

[1]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[2]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[5]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[6]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[7]  Scott Miller,et al.  A Novel Use of Statistical Parsing to Extract Information from Text , 2000, ANLP.

[8]  Vincenzo Pallotta,et al.  Robust methods in analysis of natural language data , 2002, Natural Language Engineering.

[9]  Daniel Gildea,et al.  The Necessity of Parsing for Predicate Argument Recognition , 2002, ACL.

[10]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[12]  Daniel Jurafsky,et al.  Shallow Semantic Parsing using Support Vector Machines , 2004, NAACL.

[13]  Andrew McCallum,et al.  Joint Parsing and Semantic Role Labeling , 2005, CoNLL.

[14]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[15]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[16]  Andrew McCallum,et al.  Composition of Conditional Random Fields for Transfer Learning , 2005, HLT.

[17]  Bernhard Schölkopf,et al.  Semi-Supervised Learning (Adaptive Computation and Machine Learning) , 2006 .

[18]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[19]  Gabriele Musillo,et al.  Robust Parsing of the Proposition Bank , 2006, Workshop On ROMAND Robust Methods In Analysis Of Natural Language Data.

[20]  Gholamreza Haffari,et al.  Transductive learning for statistical machine translation , 2007, ACL.

[21]  Martine De Cock,et al.  Fast Semantic Extraction Using a Novel Neural Network Architecture , 2007, ACL.

[22]  Jun'ichi Tsujii,et al.  A discriminative language model with pseudo-negative samples , 2007, ACL.

[23]  Ronen Feldman,et al.  Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web , 2007, ACL.

[24]  McCallumAndrew,et al.  Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data , 2007 .