With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains. The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as ”海角七號” (Cape No. 7, the name of a film), ”蠟筆小新” (Crayon Shinchan, the name of a cartoon figure), ”金融海嘯” (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.
[1]
Keh-Jiann Chen,et al.
Unknown Word Extraction for Chinese Documents
,
2002,
COLING.
[2]
Qin Lu,et al.
Chinese Terminology Extraction Using Window-Based Contextual Information
,
2009,
CICLing.
[3]
Nick Efford,et al.
Digital Image Processing: A Practical Introduction Using Java
,
2000
.
[4]
Yuji Matsumoto,et al.
Chinese Unknown Word Identification Using Character-based Tagging and Chunking
,
2003,
ACL.
[5]
Hai Zhao,et al.
An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework
,
2008,
IJCNLP.
[6]
José L. Pérez-Córdoba,et al.
Histogram equalization of speech representation for robust speech recognition
,
2005,
IEEE Transactions on Speech and Audio Processing.
[7]
Xiaotie Deng,et al.
Accessor Variety Criteria for Chinese Word Extraction
,
2004,
CL.
[8]
Juan Manuel Montero-Martínez,et al.
Histogram Equalization-Based Features for Speech, Music, and Song Discrimination
,
2010,
IEEE Signal Processing Letters.
[9]
Yorick Wilks,et al.
Unsupervised Learning of Word Boundary with Description Length Gain
,
1999,
CoNLL.
[10]
Lee-Feng Chien,et al.
Proceedings of Research on Computational Linguistics Conference XIII
,
2000
.
[11]
Robert A. Hummel,et al.
Image Enhancement by Histogram transformation
,
1975
.
[12]
Le Zhang,et al.
Statistical Substring Reduction in Linear Time
,
2004,
IJCNLP.