Natural Language Processing (Almost) from Scratch

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

[1]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[2]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[3]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[4]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Zellig S. Harris,et al.  A Grammar of English on Mathematical Principles , 1982 .

[7]  Yann LeCun,et al.  Une procedure d'apprentissage pour reseau a seuil asymmetrique (A learning scheme for asymmetric threshold networks) , 1985 .

[8]  D. Rumelhart Learning internal representations by back-propagating errors , 1986 .

[9]  Geoffrey E. Hinton,et al.  Learning sets of filters using back-propagation , 1987 .

[10]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[11]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[12]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[13]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[16]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[17]  Patrick Gallinari,et al.  A Framework for the Cooperation of Learning Algorithms , 1990, NIPS.

[18]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[19]  Steven C. Suddarth,et al.  Symbolic-Neural Systems and the Use of Hints for Developing Complex Systems , 1991, Int. J. Man Mach. Stud..

[20]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[21]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[22]  Hinrich Schütze Distributional Part-of-Speech Tagging , 1995, EACL.

[23]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[24]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[25]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[26]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[27]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[28]  Yoshua Bengio,et al.  Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[30]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[31]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[32]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[33]  David Maxwell Chickering,et al.  Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[34]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[35]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[36]  Yuji Matsumoto,et al.  Use of Support Vector Learning for Chunk Identification , 2000, CoNLL/LLL.

[37]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.

[38]  Scott Miller,et al.  A Novel Use of Statistical Parsing to Extract Information from Text , 2000, ANLP.

[39]  Dan Klein,et al.  Natural Language Grammar Induction Using a Constituent-Context Model , 2001, NIPS.

[40]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[41]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[42]  Daniel Gildea,et al.  The Necessity of Parsing for Predicate Argument Recognition , 2002, ACL.

[43]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Hwee Tou Ng,et al.  Named Entity Recognition with a Maximum Entropy Approach , 2003, CoNLL.

[45]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[46]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[47]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[48]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[49]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[50]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[51]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[52]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[53]  Daniel Jurafsky,et al.  Shallow Semantic Parsing using Support Vector Machines , 2004, NAACL.

[54]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[55]  Dan Roth,et al.  Generalized Inference with Multiple Semantic Role Labeling Systems , 2005, CoNLL.

[56]  Dan Roth,et al.  The Necessity of Syntactic Parsing for Semantic Role Labeling , 2005, IJCAI.

[57]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[58]  Andrew McCallum,et al.  Joint Parsing and Semantic Role Labeling , 2005, CoNLL.

[59]  Koby Crammer,et al.  Flexible Text Segmentation with Structured Multilabel Classification , 2005, HLT.

[60]  Christopher D. Manning,et al.  A Joint Model for Semantic Role Labeling , 2005, CoNLL.

[61]  Hong Shen,et al.  Voting Between Multiple Data Representations for Text Chunking , 2005, Canadian AI.

[62]  Brian Roark,et al.  Comparing and Combining Finite-State and Context-Free Parsers , 2005, HLT/EMNLP.

[63]  Phil Blunsom,et al.  Semantic Role Labelling with Tree Conditional Random Fields , 2005, CoNLL.

[64]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[65]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[66]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[67]  Andrew McCallum,et al.  Composition of Conditional Random Fields for Transfer Learning , 2005, HLT.

[68]  Daniel Jurafsky,et al.  Semantic Role Chunking Combining Complementary Syntactic Views , 2005, CoNLL.

[69]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[70]  Bernhard Schölkopf,et al.  Semi-Supervised Learning (Adaptive Computation and Machine Learning) , 2006 .

[71]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[72]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[73]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[74]  Gabriele Musillo,et al.  Robust Parsing of the Proposition Bank , 2006, Workshop On ROMAND Robust Methods In Analysis Of Natural Language Data.

[75]  Quoc V. Le,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, NIPS.

[76]  Gholamreza Haffari,et al.  Transductive learning for statistical machine translation , 2007, ACL.

[77]  Jun'ichi Tsujii,et al.  A discriminative language model with pseudo-negative samples , 2007, ACL.

[78]  Ronen Feldman,et al.  Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web , 2007, ACL.

[79]  McCallumAndrew,et al.  Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data , 2007 .

[80]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[81]  Stéphan Clémençon,et al.  Ranking the Best Instances , 2006, J. Mach. Learn. Res..

[82]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[83]  Yehuda Koren,et al.  The BellKor solution to the Netflix Prize , 2007 .

[84]  Dan Klein,et al.  Structure compilation: trading structure for features , 2008, ICML '08.

[85]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[86]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[87]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[88]  Xu Sun,et al.  Modeling Latent-Dynamic in Shallow Parsing: A Latent Conditional Model with Improved Inference , 2008, COLING.

[89]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[90]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[91]  Dekang Lin,et al.  Phrase Clustering for Discriminative Learning , 2009, ACL.

[92]  Alexander Yates,et al.  Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling , 2009, ACL.

[93]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[94]  Ronan Collobert,et al.  Deep Learning for Efficient Discriminative Parsing , 2011, AISTATS.

[95]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.