Statistical Language Modelling

Grammar-based natural language processing has reached a level where it can ‘understand’ language to a limited degree in restricted domains. For example, it is possible to parse textual material very accurately and assign semantic relations to parts of sentences. An alternative approach originates from the work of Shannon over half a century ago [41], [42]. This approach assigns probabilities to linguistic events, where mathematical models are used to represent statistical knowledge. Once models are built, we decide which event is more likely than the others according to their probabilities. Although statistical methods currently use a very impoverished representation of speech and language (typically finite state), it is possible to train the underlying models from large amounts of data. Importantly, such statistical approaches often produce useful results. Statistical approaches seem especially well-suited to spoken language which is often spontaneous or conversational and not readily amenable to standard grammar-based approaches.

[1]  Thomas Hain,et al.  The 1998 HTK broadcast news transcription system: development and results , 1999 .

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Giuliano Antoniol,et al.  Language modelling for efficient beam-search , 1995, Comput. Speech Lang..

[4]  Mari Ostendorf,et al.  Robust information extraction from automatically generated speech transcriptions , 2000, Speech Commun..

[5]  Steve Renals,et al.  Indexing and retrieval of broadcast news , 2000, Speech Commun..

[6]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[7]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[8]  Roland Kuhn,et al.  Speech Recognition and the Frequency of Recently Used Words: A Modified Markov Model for Natural Language , 1988, COLING.

[9]  Marc Moens,et al.  Description of the LTG System Used for MUC-7 , 1998, MUC.

[10]  Steve Renals,et al.  Information extraction from broadcast news , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[11]  Maria Huhtala,et al.  Random Variables and Stochastic Processes , 2021, Matrix and Tensor Decompositions in Signal Processing.

[12]  M. Degroot Optimal Statistical Decisions , 1970 .

[13]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[14]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[15]  Stephen A. Lowe The beta-binomial mixture model for word frequencies in documents with applications to information retrieval , 1999, EUROSPEECH.

[16]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[17]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[18]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[19]  Clement T. Yu,et al.  Effective information retrieval using term accuracy , 1977, CACM.

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[21]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[22]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[23]  NeyHermann,et al.  On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995 .

[24]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[25]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[26]  Salim Roukos,et al.  Maximum likelihood and discriminative training of direct translation models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[27]  K. Sparck Jones,et al.  A Probabilistic Model of Information Retrieval : Development and Status , 1998 .

[28]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[29]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Frederick Jelinek,et al.  Up from trigrams! - the struggle for improved language models , 1991, EUROSPEECH.

[31]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[32]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[34]  Steve Renals,et al.  Topic-based mixture language modelling , 1999, Nat. Lang. Eng..

[35]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[36]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[37]  Frederick Jelinek,et al.  Structured language modeling , 2000, Comput. Speech Lang..

[38]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[39]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[40]  Lynette Hirschman,et al.  MITRE: Description of the Alembic System Used for MUC-6 , 1995, MUC.

[41]  Ralph Weischedel,et al.  NAMED ENTITY EXTRACTION FROM SPEECH , 1998 .

[42]  Lynette Hirschman,et al.  Overview: Information Extraction From Broadcast News , 1999 .

[43]  Hermann Ney,et al.  On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[45]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[46]  Yorick Wilks,et al.  Evaluation of an Algorithm for the Recognition and Classification of Proper Names , 1996, COLING.

[47]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[48]  SchwartzRichard,et al.  An Algorithm that Learns Whats in a Name , 1999 .