Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model

This paper presents a Chinese named entity recognizer (NER): Mencius. It aims to address Chinese NER problems by combining the advantages of rule-based and machine learning (ML) based NER systems. Rule-based NER systems can explicitly encode human comprehension and can be tuned conveniently, while ML-based systems are robust, portable and inexpensive to develop. Our hybrid system incorporates a rule-based knowledge representation and template-matching tool, called InfoMap [Wu et al. 2002], into a maximum entropy (ME) framework. Named entities are represented in InfoMap as templates, which serve as ME features in Mencius. These features are edited manually, and their weights are estimated by the ME framework according to the training data. To understand how word segmentation might influence Chinese NER and the differences between a pure template-based method and our hybrid method, we configure Mencius using four distinct settings. The F-Measures of person names (PER), location names (LOC) and organization names (ORO) of the best configuration in our experiment were respectively 94.3%, 77.8% and 75.3%. From comparing the experiment results obtained using these configurations reveals that hybrid NER Systems always perform better performance in identifying person names. On the other hand, they have a little difficulty identifying location and organization names. Furthermore, using a word segmentation module improves the performance of pure Template-based NER Systems, but, it has little effect on hybrid NER systems.

[1]  David D. McDonald Internal and External Evidence in the Identification and Semantic Categorization of Proper Names , 1993 .

[2]  Nancy Chinchor,et al.  Statistical Significance of MUC-6 Results , 1995, MUC.

[3]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[4]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[5]  Shih-Hung Wu,et al.  FAQ-Centered Organizational Memory , 2002 .

[6]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[7]  Sam Coates-Stephens,et al.  The analysis and acquisition of proper names for robust text understanding , 1992 .

[8]  Lei Zhang,et al.  Chinese Named Entity Identification Using Class-based Language Model , 2002, COLING.

[9]  Marc Moens,et al.  Description of the LTG System Used for MUC-7 , 1998, MUC.

[10]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[11]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[12]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[13]  Hsin-Hsi Chen,et al.  Description of the NTU System used for MET-2 , 1998, MUC.

[14]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[15]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[16]  Shuanhu Bai,et al.  Description of the Kent Ridge Digital Labs System Used for MUC-7 , 1998, MUC.

[17]  Nancy Chinchor,et al.  Appendix E: MUC-7 Named Entity Task Definition (version 3.5) , 1998, MUC.

[18]  Richard M. Schwartz,et al.  BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.