Chinese word segmentation is a very important problem for Chinese information processing. Chinese word segmentation results are the basis for computers to understand natural language. However, unlike most Western languages, Chinese words do not have fixed symbols like white space as word segmentation marks. Moreover, Chinese has a very complex grammar, and the word segmentation criteria are varied with the contexts. Therefore, Chinese word segmentation is a very difficult task. Many existing works have proposed many algorithms to solve this problem. However, to our best knowledge, none of them could outperform all the other methods. In this paper, we develop a novel algorithm based on semantics and contexts. We propose a semantic-based word similarity measure using the concept hierarchy in knowledge graphs, and use this measure to prune the different results which are generated by several state-of-the-art Chinese word segmentation methods. The idea is to respectively compute the concept similarity of these words to other words in the text, and choose the word with the highest concept similarity score. To evaluate the effectiveness of the proposed approach, we conduct a series of experiment on two real datasets. The results show that our method outperforms all the state-of-the-art algorithms by filtering out wrong results and retaining correct ones.
[1]
Keh-Jiann Chen,et al.
Word Identification for Mandarin Chinese Sentences
,
1992,
COLING.
[2]
Hyunjin Kim,et al.
PAC-k: A Parallel Aho-Corasick String Matching Approach on Graphic Processing Units Using Non-Overlapped Threads
,
2016,
IEICE Trans. Commun..
[3]
Nianwen Xue,et al.
Combining Classifiers for Chinese Word Segmentation
,
2002,
SIGHAN@COLING.
[4]
Zhou Shuan-long.
Analysis on Chinese Segmentation Algorithm of Lucene.net
,
2011
.
[5]
Nianwen Xue,et al.
Chinese Word Segmentation as Character Tagging
,
2003,
ROCLING/IJCLCLP.
[6]
Chunyu Kit,et al.
Chinese word segmentation as morpheme-based lexical chunking
,
2008,
Inf. Sci..
[7]
Rajeev Motwani,et al.
The PageRank Citation Ranking : Bringing Order to the Web
,
1999,
WWW 1999.
[8]
Carlos D. Martínez-Hinarejos,et al.
Unsegmented Dialogue Act Annotation and Decoding With N-Gram Transducers
,
2015,
IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[9]
Guillaume Lample,et al.
Neural Architectures for Named Entity Recognition
,
2016,
NAACL.
[10]
Qun Liu,et al.
HHMM-based Chinese Lexical Analyzer ICTCLAS
,
2003,
SIGHAN.