Tagging a Corpus of Spoken Swedish

In this article, we present and evaluate a method for training a statistical part-of-speech tagger on data from written language and then adapting it to the requirements of tagging a corpus of transcribed spoken language, in our case spoken Swedish. This is currently a significant problem for many research groups working with spoken language, since the availability of tagged training data from spoken language is still very limited for most languages. The overall accuracy of the tagger developed for spoken Swedish is quite respectable, varying from 95% to 97% depending on the tagset used. In conclusion, we argue that the method presented here gives good tagging accuracy with relatively little effort.

[1]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[4]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[5]  Joakim Nivre,et al.  On the Semantics and Pragmatics of Linguistic Feedback , 1992, J. Semant..

[6]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[7]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[8]  R. Quirk,et al.  A Corpus of English Conversation , 1980 .

[9]  Joakim Nivre,et al.  Sparse Data and Smoothing in Statistical Part-of-Speech Tagging* , 2000, J. Quant. Linguistics.

[10]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[11]  Thorsten Brants,et al.  Tagging the Teleman Corpus , 1995, ArXiv.

[12]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[13]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[14]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[15]  Jens Allwood,et al.  The Swedish Spoken Language Corpus at Göteborg University , 1999 .

[16]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[17]  Joakim Nivre,et al.  Towards Multimodal Spoken Language Corpora: TransTool and SyncTool , 1998 .

[18]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[19]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[20]  Gunnar Eriksson,et al.  The Linguistic Annotation System of the Stockholm - Umea , 1993, EACL.

[21]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[22]  G. Kallgren Linguistic Indeterminacy as a Source of Errors in Tagging , 1996, COLING.

[23]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[24]  Joakim Nivre,et al.  Speech Management—on the Non-written Life of Speech , 1990, Nordic Journal of Linguistics.

[25]  Sven C. Martin,et al.  Statistical Language Modeling Using Leaving-One-Out , 1997 .

[26]  W. Hays Statistical theory. , 1968, Annual review of psychology.

[27]  Kenneth Ward Church,et al.  Poor Estimates of Context are Worse than None , 1990, HLT.

[28]  Christer Samuelsson,et al.  Morphological Tagging Based Entirely on Bayesian Inference , 1993, NODALIDA.

[29]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[30]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[31]  Jean-Pierre Chanod,et al.  Tagging French - comparing a statistical and a constraint-based method , 1995, EACL.

[32]  Joakim Nivre,et al.  Tagging Spoken Language Using Written Language Statistics , 1996, COLING.

[33]  Mats Eeg-Olofsson Word-class tagging : some computational tools , 1991 .