A Weibo-Oriented Method for Unknown Word Extraction

Unknown word recognition is one of the most prominent and challenging problems in the Chinese language processing. Some effective approaches have been proposed, however, they do not work well on Chinese twitter (i.e. weibo) messages. In this paper, a method is presented to recognize unknown words from weibo. Due to the great flexibility in wording and highly correlation between unknown words and unpredictable topics, which are exhibited in weibo messages, the proposed method firstly groups the corpus into multiple categories by using K-means, then, from each of the categories, a morpheme set is derived based on local terms frequencies. Secondly, as for each potential unknown word in every morpheme set, a newly introduced measure (named adjacency degree) is calculated to see if a correct unknown word is found. It could be shown by the experiments that the proposed method is efficient, precise, and insensitive to the size of the weibo corpus.

[1]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[2]  Wang Jing Approach for Lexicon Updating Based on Data Mining , 2006 .

[3]  Hao Chen,et al.  Unknown Word Recognition Based on Maximal Cliques , 2011, 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[4]  Keh-Jiann Chen,et al.  Unknown Word Extraction for Chinese Documents , 2002, COLING.

[5]  Andi Wu,et al.  Statistically-Enhanced New Word Identification in a Rule-Based Chinese System , 2000, ACL 2000.