Existing techniques for tokenisation and sentence boundary identification are extremely accurate when the data is perfectly clean (Mikheev, 2002), and have been applied successfully to corpora of news feeds and other post-edited corpora. Informal written texts are readily available, and with the growth of other informal text modalities (IRC, ICQ, SMS etc.) are becoming an interesting alternative, perhaps better suited as a source for lexical resources and language models for studies of dialogue and spontaneous speech. However, the high degree of spelling errors, irregularities and idiosyncrasies in the use of punctuation, white space and capitalisation require specialised tools. In this paper we study the design and implementation of a tool for pre-processing and normalisation of noisy corpora. We argue that rather than having separate tools for tokenisation, segmentation and spelling correction organised in a pipeline, a unified tool is appropriate because of certain specific sorts of errors. We describe how a noisy channel model can be used at the character level to perform this. We describe how the sequence of tokens needs to be divided into various types depending on their characteristics, and also how the modelling of white-space needs to be conditioned on the type of the preceding and following tokens. We use trainable stochastic transducers to model typographical errors, and other orthographic changes and a variety of sequence models for white space and the different sorts of tokens. We discuss the training of the models and various efficiency issues related to the decoding algorithm, and illustrate this with examples from a 100 million word corpus of Usenet news.
[1]
Kenneth Ward Church,et al.
A Spelling Correction Program Based on a Noisy Channel Model
,
1990,
COLING.
[2]
Janet M. Baker,et al.
The Design for the Wall Street Journal-based CSR Corpus
,
1992,
HLT.
[3]
Beatrice Santorini,et al.
Building a Large Annotated Corpus of English: The Penn Treebank
,
1993,
CL.
[4]
Hermann Ney,et al.
On structuring probabilistic dependences in stochastic language modelling
,
1994,
Comput. Speech Lang..
[5]
Frederick Jelinek,et al.
Statistical methods for speech recognition
,
1997
.
[6]
Marc Moens,et al.
LT TTT - A Flexible Tokenisation Tool
,
2000,
LREC.
[7]
Yingying Wen,et al.
A compression based algorithm for Chinese word segmentation
,
2000,
CL.
[8]
Eric Brill,et al.
An Improved Error Model for Noisy Channel Spelling Correction
,
2000,
ACL.
[9]
Jianfeng Gao,et al.
A unified approach to statistical language modeling for Chinese
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).
[10]
Shankar Kumar,et al.
Normalization of non-standard words
,
2001,
Comput. Speech Lang..
[11]
Andrei Mikheev,et al.
Periods, Capitalized Words, etc.
,
2002,
CL.