A Hybrid Approach to Part-of-Speech Tagging

Part-of-Speech (PoS) Tagging – the automatic annotation of lexical categories – is a widely used early stage of linguistic text analysis. One approach, rule-based morphological anaylsis, employs linguistic knowledge in the form of hand-coded rules to derive a set of possible analyses for each input token, but is known to produce highly ambiguous results. Stochastic tagging techniques such as Hidden Markov Models (HMMs) make use of both lexical and bigram probabilities estimated from a tagged training corpus in order to compute the most likely PoS tag sequence for each input sentence, but provide no allowance for prior linguistic knowledge. In this report, I describe the dwdst PoS tagging library, which makes use of a rule-based morphological component to extend traditional HMM techniques by the inclusion of lexical class probabilities and theoretically motivated search space reduction.

[1]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[2]  Alex Waibel,et al.  The Automatic Acquisition of Frequencies of Verb Subcategorization Frames from Tagged Corpora , 2002 .

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[5]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[6]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[7]  Kimmo Koskenniemi,et al.  A Compiler for Two-level Phonological Rules , 1987 .

[8]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[9]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[10]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[11]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[12]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[13]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[14]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[15]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[16]  Christer Samuelsson,et al.  Morphological Tagging Based Entirely on Bayesian Inference , 1993, NODALIDA.

[17]  C. Pollard,et al.  Center for the Study of Language and Information , 2022 .

[18]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[19]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[20]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[21]  Eugene Charniak,et al.  Statistical language learning , 1997 .