This article introduces a corpus-based method for improving the process of automatic morphological analysis of a non-standard text variety. More precisely, our paper is concerned with the morphological analysis of Estonian chatroom texts. First, the morphological analyzer designed for the standard written Estonian is used for the analysis of chatroom texts. On the basis of output error analysis a method for improving the process is proposed. We take advantage of the fact that there are deviations with high token frequency, but low type frequency, on the one hand, and deviations with low token frequency, but high type frequency, on the other hand. The first group has to be manually compiled into a user lexicon, whereas the second group of errors can be taken care of by automatic means: automatic preprocessing of texts and automatic complementation of the user lexicon. As a result, the percentage of unknown tokens in the output of the morphological analyzer decreases from 27 to 10.5.
[1]
Vincent Ooi.
Aspects of computer-mediated communication for research in corpus linguistics
,
2002
.
[2]
Stefan Th. Gries,et al.
k dixez? A corpus study of Spanish Internet orthography
,
2010,
Lit. Linguistic Comput..
[3]
E. Rabinovitch,et al.
The language Of The Internet
,
1998,
IEEE Communications Magazine.
[4]
Craig H. Martell,et al.
Lexical and Discourse Analysis of Online Chat Dialog
,
2007,
International Conference on Semantic Computing (ICSC 2007).
[5]
Jean Aitchison,et al.
Language and the Internet
,
2002,
Lit. Linguistic Comput..