Statistical methods in language processing.

The term statistical methods here refers to a methodology that has been dominant in computational linguistics since about 1990. It is characterized by the use of stochastic models, substantial data sets, machine learning, and rigorous experimental evaluation. The shift to statistical methods in computational linguistics parallels a movement in artificial intelligence more broadly. Statistical methods have so thoroughly permeated computational linguistics that almost all work in the field draws on them in some way. There has, however, been little penetration of the methods into general linguistics. The methods themselves are largely borrowed from machine learning and information theory. We limit attention to that which has direct applicability to language processing, though the methods are quite general and have many nonlinguistic applications. Not every use of statistics in language processing falls under statistical methods as we use the term. Standard hypothesis testing and experimental design, for example, are not covered in this article. WIREs Cogni Sci 2011 2 315-322 DOI: 10.1002/wcs.111 For further resources related to this article, please visit the WIREs website.

[1]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[2]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[3]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[4]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  J. Gerard Wolff,et al.  Language acquisition, data compression and generalization , 1982 .

[7]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[8]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[9]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[10]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[11]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[12]  Michael Collins,et al.  New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron , 2002, ACL.

[13]  Michael R. Brent,et al.  Automatic Acquisition of Subcategorization Frames from Tagged Text , 1991, HLT.

[14]  Ronitt Rubinfeld,et al.  Efficient learning of typical finite automata from random walks , 1993, STOC.

[15]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[16]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[17]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[18]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[19]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[20]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[21]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[25]  Zhiyi Chi,et al.  Estimation of Probabilistic Context-Free Grammars , 1998, Comput. Linguistics.

[26]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[27]  Steven Abney,et al.  Semisupervised Learning for Computational Linguistics , 2007 .

[28]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[29]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[30]  Taylor L. Booth,et al.  Grammatical Inference: Introduction and Survey - Part I , 1975, IEEE Trans. Syst. Man Cybern..

[31]  Philip Resnik,et al.  Probabilistic Tree-Adjoining Grammar as a Framework for Statistical Natural Language Processing , 1992, COLING.

[32]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[33]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[34]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[35]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[36]  Christopher D. Manning,et al.  The unsupervised learning of natural language structure , 2005 .

[37]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[38]  James Jay Horning,et al.  A study of grammatical inference , 1969 .

[39]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[40]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[41]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[42]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[43]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[44]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[45]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[46]  Steven P. Abney Stochastic Attribute-Value Grammars , 1996, CL.

[47]  Steven Finch,et al.  Finding structure in language , 1995 .

[48]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[49]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[50]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[51]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[52]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[53]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[54]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[55]  Stanley F. Chen,et al.  Building Probabilistic Models for Natural Language , 1996, ArXiv.