The problem in processing Chinese chat text originates from the anomalous characteristics and dynamic nature of such a text genre. That is, it uses ill-edited terms and anomalous writing styles in chat text, and the anomaly is created and discarded very quickly. To handle this problem, one solution is to re-train the recognizer periodically. This costs a lot of manpower in producing the timely chat text corpus. The new approaches are proposed in this paper to detect the anomaly within dynamic Chinese chat text by incorporating standard Chinese corpora and chat corpus. We first model standard language text using standard Chinese corpora and apply these models to detect anomalous chat text. To improve detection quality, we construct anomalous chat language model using one static chat text corpus and incorporate this model into the standard language models. Our approaches calculate confidence and entropy for the input text and apply threshold values to help make the decisions. The experiments prove that performance equivalent to the best ones produced by the approaches in existence can be achieved stably with our approaches.
[1]
Robert L. Mercer,et al.
Class-Based n-gram Models of Natural Language
,
1992,
CL.
[2]
Wei Gao,et al.
NIL Is Not Nothing: Recognition of Chinese Network Informal Language Expressions
,
2005,
IJCNLP.
[3]
Kam-Fai Wong,et al.
A Two-Stage Incremental Annotation Approach to Constructing a Network Informal Language Corpus
,
2005,
NTCIR.
[4]
Anita Pincas,et al.
Report into the use of Chat in education
,
2006
.
[5]
Janis Wolak,et al.
Online victimization: A report on the nation’s youth.
,
2000
.
[6]
Qun Liu,et al.
HHMM-based Chinese Lexical Analyzer ICTCLAS
,
2003,
SIGHAN.