Inducing Gazetteer for Chinese Named Entity Recognition Based on Local High-Frequent Strings

Gazetteers, or entity dictionaries, are important for named entity recognition (NER). Although the dictionaries extracted automatically by the previous methods from a corpus, web or Wikipedia are very huge, they also misses some entities, especially the domain-specific entities. We present a novel method of automatic entity dictionary induction, which is able to construct a dictionary more specific to the processing text at a much lower computational cost than the previous methods. It extracts the local high-frequent strings in a document as candidate entities, and filters the invalid candidates with the accessor variety (AV) as our entity criterion. The experiments show that the obtained dictionary can effectively improve the performance of a high-precision baseline of NER.

[1]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[2]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[3]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[4]  Zhiyong Luo,et al.  An Integrated Method for Chinese Unknown Word Extraction , 2004, SIGHAN@ACL.

[5]  Hai Zhao,et al.  Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition , 2008, IJCNLP.

[6]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[7]  Kentaro Torisawa,et al.  Inducing Gazetteers for Named Entity Recognition by Large-Scale Clustering of Dependency Relations , 2008, ACL.

[8]  Xinnian Mao,et al.  Chinese Word Segmentation and Named Entity Recognition Based on Conditional Random Fields , 2008, IJCNLP.

[9]  Xiao Chen,et al.  The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese POS Tagging , 2008, IJCNLP.

[10]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[11]  Christopher D. Manning,et al.  An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition , 2006, ACL.

[12]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[13]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[14]  Bo Chen,et al.  Chinese NER Using CRFs and Logic for the Fourth SIGHAN Bakeoff , 2008, IJCNLP.

[15]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.