论文信息 - Natural Language Processing (Almost) from Scratch

Natural Language Processing (Almost) from Scratch

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

[1] Claude E. Shannon,et al. Prediction and Entropy of Printed English , 1951 .

[2] Zellig S. Harris,et al. Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[3] F. Jelinek,et al. Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[4] Thomas M. Cover,et al. A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[5] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[6] Zellig S. Harris,et al. A Grammar of English on Mathematical Principles , 1982 .

[7] Yann LeCun,et al. Une procedure d'apprentissage pour reseau a seuil asymmetrique (A learning scheme for asymmetric threshold networks) , 1985 .

[8] D. Rumelhart. Learning internal representations by back-propagating errors , 1986 .

[9] Geoffrey E. Hinton,et al. Learning sets of filters using back-propagation , 1987 .

[10] N. Littlestone. Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[11] Judea Pearl,et al. Probabilistic reasoning in intelligent systems , 1988 .

[12] John Scott Bridle,et al. Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[13] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[14] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15] Geoffrey E. Hinton,et al. Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[16] Pierre Priouret,et al. Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[17] Patrick Gallinari,et al. A Framework for the Cooperation of Learning Algorithms , 1990, NIPS.

[18] L. Bottou. Stochastic Gradient Learning in Neural Networks , 1991 .

[19] Steven C. Suddarth,et al. Symbolic-Neural Systems and the Use of Hints for Developing Complex Systems , 1991, Int. J. Man Mach. Stud..

[20] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[21] Robert L. Mercer,et al. An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[22] Hinrich Schütze. Distributional Part-of-Speech Tagging , 1995, EACL.

[23] Geoffrey E. Hinton,et al. Bayesian Learning for Neural Networks , 1995 .

[24] John G. Cleary,et al. The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[25] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[26] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[27] Yoram Singer,et al. Learning to Order Things , 1997, NIPS.

[28] Yoshua Bengio,et al. Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[30] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[31] Ralph Grishman,et al. A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[32] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[33] David Maxwell Chickering,et al. Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[34] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[35] Eugene Charniak,et al. A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[36] Yuji Matsumoto,et al. Use of Support Vector Learning for Chunk Identification , 2000, CoNLL/LLL.

[37] Daniel Gildea,et al. Automatic Labeling of Semantic Roles , 2000, ACL.

[38] Scott Miller,et al. A Novel Use of Statistical Parsing to Extract Information from Text , 2000, ANLP.

[39] Dan Klein,et al. Natural Language Grammar Induction Using a Constituent-Context Model , 2001, NIPS.

[40] Yuji Matsumoto,et al. Chunking with Support Vector Machines , 2001, NAACL.

[41] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[42] Daniel Gildea,et al. The Necessity of Parsing for Predicate Argument Recognition , 2002, ACL.

[43] Jean-Luc Gauvain,et al. Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44] Hwee Tou Ng,et al. Named Entity Recognition with a Maximum Entropy Approach , 2003, CoNLL.

[45] Michael Collins,et al. Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[46] Fernando Pereira,et al. Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[47] Wei Li,et al. Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[48] Tong Zhang,et al. Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[49] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[50] Andrew McCallum,et al. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[51] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[52] Lluís Màrquez i Villodre,et al. SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[53] Daniel Jurafsky,et al. Shallow Semantic Parsing using Support Vector Machines , 2004, NAACL.

[54] Scott Miller,et al. Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[55] Dan Roth,et al. Generalized Inference with Multiple Semantic Role Labeling Systems , 2005, CoNLL.

[56] Dan Roth,et al. The Necessity of Syntactic Parsing for Semantic Role Labeling , 2005, IJCAI.

[57] Percy Liang,et al. Semi-Supervised Learning for Natural Language , 2005 .

[58] Andrew McCallum,et al. Joint Parsing and Semantic Role Labeling , 2005, CoNLL.

[59] Koby Crammer,et al. Flexible Text Segmentation with Structured Multilabel Classification , 2005, HLT.

[60] Christopher D. Manning,et al. A Joint Model for Semantic Role Labeling , 2005, CoNLL.

[61] Hong Shen,et al. Voting Between Multiple Data Representations for Text Chunking , 2005, Canadian AI.

[62] Brian Roark,et al. Comparing and Combining Finite-State and Context-Free Parsers , 2005, HLT/EMNLP.

[63] Phil Blunsom,et al. Semantic Role Labelling with Tree Conditional Random Fields , 2005, CoNLL.

[64] Tong Zhang,et al. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[65] Noah A. Smith,et al. Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[66] Daniel Gildea,et al. The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[67] Andrew McCallum,et al. Composition of Conditional Random Fields for Transfer Learning , 2005, HLT.

[68] Daniel Jurafsky,et al. Semantic Role Chunking Combining Complementary Syntactic Views , 2005, CoNLL.

[69] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[70] Bernhard Schölkopf,et al. Semi-Supervised Learning (Adaptive Computation and Machine Learning) , 2006 .

[71] Alaa A. Kharbouch,et al. Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[72] Eugene Charniak,et al. Effective Self-Training for Parsing , 2006, NAACL.

[73] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[74] Gabriele Musillo,et al. Robust Parsing of the Proposition Bank , 2006, Workshop On ROMAND Robust Methods In Analysis Of Natural Language Data.

[75] Quoc V. Le,et al. Learning to Rank with Nonsmooth Cost Functions , 2006, NIPS.

[76] Gholamreza Haffari,et al. Transductive learning for statistical machine translation , 2007, ACL.

[77] Jun'ichi Tsujii,et al. A discriminative language model with pseudo-negative samples , 2007, ACL.

[78] Ronen Feldman,et al. Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web , 2007, ACL.

[79] McCallumAndrew,et al. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data , 2007 .

[80] Giorgio Satta,et al. Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[81] Stéphan Clémençon,et al. Ranking the Best Instances , 2006, J. Mach. Learn. Res..

[82] Geoffrey E. Hinton,et al. Three new graphical models for statistical language modelling , 2007, ICML '07.

[83] Yehuda Koren,et al. The BellKor solution to the Netflix Prize , 2007 .

[84] Dan Klein,et al. Structure compilation: trading structure for features , 2008, ICML '08.

[85] Xavier Carreras,et al. Simple Semi-supervised Dependency Parsing , 2008, ACL.

[86] Jun Suzuki,et al. Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[87] Jason Weston,et al. Deep learning via semi-supervised embedding , 2008, ICML '08.

[88] Xu Sun,et al. Modeling Latent-Dynamic in Shallow Parsing: A Latent Conditional Model with Improved Inference , 2008, COLING.

[89] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[90] Dan Roth,et al. Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[91] Dekang Lin,et al. Phrase Clustering for Discriminative Learning , 2009, ACL.

[92] Alexander Yates,et al. Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling , 2009, ACL.

[93] Yoshua Bengio,et al. Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[94] Ronan Collobert,et al. Deep Learning for Efficient Discriminative Parsing , 2011, AISTATS.

[95] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.