A Class-based Language Model Approach to Chinese Named Entity Identification

This paper presents a method of Chinese named entity (NE) identification using a class-based language model (LM). Our NE identification concentrates on three types of NEs, namely, personal names (PERs), location names (LOCs) and organization names (ORGs). Each type of NE is defined as a class. Our language model consists of two sub-models: (1) a set of entity models, each of which estimates the generative probability of a Chinese character string given an NE class; and (2) a contextual model, which estimates the generative probability of a class sequence. The class-based LM thus provides a statistical framework for incorporating Chinese word segmentation and NE identification in a unified way. This paper also describes methods for identifying nested NEs and NE abbreviations. Evaluation based on a test data with broad coverage shows that the proposed model achieves the performance of state-of-the-art Chinese NE identification systems.

[1]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[2]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[3]  Tom McArthur,et al.  Longman Lexicon of Contemporary English , 1981 .

[4]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[8]  Yorick Wilks,et al.  Subject-Dependent Co-Occurence and Word Sense Disambiguation , 1991, ACL.

[9]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[10]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[11]  Kenneth Ward Church,et al.  Using bilingual materials to develop word sense disambiguation methods , 1992, TMI.

[12]  David D. McDonald Internal and External Evidence in the Identification and Semantic Categorization of Proper Names , 1993 .

[13]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[14]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[15]  Kevin Knight,et al.  Building a Large-Scale Knowledge Base for Machine Translation , 1994, AAAI.

[16]  Makoto Nagao,et al.  A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[17]  Alon Itai,et al.  Word Sense Disambiguation Using a Second Language Monolingual Corpus , 1994, CL.

[18]  Hsin-Hsi Chen,et al.  The Identification of Organization Names in Chinese Texts , 1994 .

[19]  Lynette Hirschman,et al.  MITRE: Description of the Alembic System Used for MUC-6 , 1995, MUC.

[20]  Joe Zhou,et al.  Automatic Suggestion of Significant Terms for a Predefined Topic , 1995, VLC@ACL.

[21]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[22]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[23]  Alpha K. Luk Statistical Sense Disambiguation with Relatively Small Corpora Using Dictionary Definitions , 1995, ACL.

[24]  Ralph Grishman,et al.  The NYU System for MUC-6 or Where’s the Syntax? , 1995, MUC.

[25]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[26]  Djoerd Hiemstra,et al.  Using statistical methods to create a bilingual dictionary , 1996 .

[27]  I. Dan Melamed Automatic Construction of Clean Broad-Coverage Translation Lexicons , 1996, AMTA.

[28]  James Pustejovsky,et al.  Corpus processing for lexical acquisition , 1996 .

[29]  Satoru Ikehara,et al.  Learning Bilingual Collocations by Word-Level Sorting , 1996, COLING.

[30]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[31]  David D. Palmer,et al.  A Statistical Profile of the Named Entity Task , 1997, ANLP.

[32]  Wim Peters,et al.  The Multilingual design of the EuroWordNet Database , 1997 .

[33]  Horacio Rodríguez,et al.  Combining Multiple Methods for the Automatic Construction of Multilingual WordNets , 1997, ArXiv.

[34]  Sayori Shimohata,et al.  Retrieving Collocations by Co-Occurrences and Word Order Constraints , 1997, ACL.

[35]  Sayori Shimohata,et al.  Retrieving Collocations by Co-Occurrences and Word Order Constraints , 1997, ACL.

[36]  I. Dan Melamed Automatic Discovery of Non-Compositional Compounds in Parallel Data , 1997, EMNLP.

[37]  Clement T. Yu,et al.  Using semantic contents and WordNet in image retrieval , 1997, SIGIR '97.

[38]  Takenobu Tokunaga,et al.  The Use of WordNet in Information Retrieval , 1998, WordNet@ACL/COLING.

[39]  Jean Véronis,et al.  Methods and Practical Issues in Evaluating Alignment Techniques , 1998, COLING-ACL.

[40]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[41]  Hsin-Hsi Chen,et al.  Description of the NTU System used for MET-2 , 1998, MUC.

[42]  Fernando Gomez Linking WordNet Verb Classes to Semantic Interpretation , 1998, WordNet@ACL/COLING.

[43]  Nancy Chinchor,et al.  Appendix E: MUC-7 Named Entity Task Definition (version 3.5) , 1998, MUC.

[44]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[45]  Yorick Wilks,et al.  University of Sheffield: Description of the LaSIE System as Used for MUC-6 , 1995, MUC.

[46]  Marc Moens,et al.  Description of the LTG System Used for MUC-7 , 1998, MUC.

[47]  Hongyan Jing,et al.  Usage of WordNet in Natural Language Generation , 1998, WordNet@ACL/COLING.

[48]  Jason S. Chang,et al.  Taxonomy and Lexical Semantics - From the Perspective of Machine Readable Dictionaries , 1998, AMTA.

[49]  Ralph Grishman,et al.  A Decision Tree Method for Finding and Classifying Names in Japanese Texts , 1998, VLC@COLING/ACL.

[50]  Shuanhu Bai,et al.  Description of the Kent Ridge Digital Labs System Used for MUC-7 , 1998, MUC.

[51]  Fabio Rinaldi,et al.  FACILE: Description of the NE System Used for MUC-7 , 1998, MUC.

[52]  George R. Krupka,et al.  IsoQuest Inc.: Description of the NetOwl™ Extractor System as Used for MUC-7 , 1998, MUC.

[53]  Ralph Weischedel,et al.  NAMED ENTITY EXTRACTION FROM SPEECH , 1998 .

[54]  Ellen M. Voorhees,et al.  Disambiguating Highly Ambiguous Words , 1998, CL.

[55]  Horacio Rodríguez,et al.  Using WordNet for Building WordNets , 1998, WordNet@ACL/COLING.

[56]  Richard M. Schwartz,et al.  BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.

[57]  Jun-ichi Fukumoto,et al.  Description of the Oki System as Used for MET-2 , 1998, MUC.

[58]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[59]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[60]  Rada Mihalcea,et al.  A Method for Word Sense Disambiguation of Unrestricted Text , 1999, ACL.

[61]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[62]  Lluís Padró,et al.  Mapping Multilingual Hierarchies Using Relaxation Labeling , 1999, EMNLP.

[63]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[64]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[65]  Daniel M. Bikel,et al.  Automatic WordNet Mapping Using Word Sense Disambiguation , 2000, EMNLP.

[66]  Lluís Padró,et al.  Mapping WordNets Using Structural Information , 2000, ACL.

[67]  Janine Toole A Hybrid Approach to the Identification and Expansion of Abbreviations , 2000, RIAO.

[68]  Hsin-Hsi Chen,et al.  Sense-Tagging Chinese Corpus , 2000, ACL 2000.

[69]  Steve Renals,et al.  Information extraction from broadcast news , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[70]  Gina-Anne Levow,et al.  Chinese-English Semantic Resource Construction , 2000, LREC.

[71]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[72]  V. Sornlertlamvanich,et al.  Towards Building a Corpus-based Dictionary for Non-word-boundary Languages , 2000 .

[73]  Hsin-Hsi Chen,et al.  Construction of a Chinese-English WordNet and its application to CLIR , 2000, IRAL '00.

[74]  Špela Vintar,et al.  Using parallel corpora for translation-oriented term extraction , 2000 .

[75]  Lv Ya Leveled Unknown Chinese Words Resolution by Dynamic Programming , 2001 .

[76]  Richard Sproat,et al.  Corpus-Based Methods in Chinese Morphology and Phonology , 2001 .

[77]  Jianfeng Gao,et al.  The Use of Clustering Techniques for Language Modeling V Application to Asian Language , 2001, ROCLING/IJCLCLP.

[78]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[79]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[80]  Erik F. Tjong Kim Sang,et al.  Memory-Based Named Entity Recognition , 2002, CoNLL.

[81]  Marine Carpuat,et al.  Boosting for Named Entity Recognition , 2002, CoNLL.

[82]  Qun Liu,et al.  Automatic Recognition of Chinese Unknown Words Based on Roles Tagging , 2002, SIGHAN@COLING.

[83]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[84]  Radu Florian,et al.  Named Entity Recognition as a House of Cards: Classifier Stacking , 2002, CoNLL.

[85]  Martin Jansche Named Entity Extraction with Conditional Markov Models and Classifiers , 2002, CoNLL.

[86]  Tat-Seng Chua,et al.  An Agent-based Approach to Chinese Named Entity Recognition , 2002, COLING.

[87]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[88]  Xavier Carreras,et al.  Named Entity Extraction using AdaBoost , 2002, CoNLL.

[89]  Pascale Fung,et al.  Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet , 2002 .

[90]  Zhang Hua-ping Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method , 2002 .

[91]  Changhua Yang,et al.  Considerations of Linking WordNet with MRD , 2002, COLING.

[92]  Koji Tsukamoto,et al.  Learning with Multiple Stacking for Named Entity Recognition , 2002, CoNLL.

[93]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[94]  William J. Black,et al.  Language Independent Named Entity Classification by modified Transformation-based Learning and by Decision Tree Induction , 2002, CoNLL.

[95]  James Mayfield,et al.  Entity Extraction without Language-Specific Resources , 2002, CoNLL.

[96]  Kok-Wee Gan,et al.  Knowledge-based sense pruning using the HowNet : an alternative to word sense disambiguation , 2002 .

[97]  Lei Zhang,et al.  Chinese Named Entity Identification Using Class-based Language Model , 2002, COLING.

[98]  Gaël Dias,et al.  Normalization of Association Measures for Multiword Lexical Unit Extraction , 2004 .

[99]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[100]  Dekai Wu,et al.  Large-scale automatic extraction of an English-Chinese translation lexicon , 2004, Machine Translation.

[101]  Sanda M. Harabagiu,et al.  The Informative Role of WordNet in Open-Domain Question Answering , 2004, HLT-NAACL 2004.

[102]  Shiwen Yu,et al.  基于現代漢語語法信息詞典的詞語情感評價研究 (Research on Lexical Emotional Evaluation Based on the Grammatical Knowledge-Base of Contemporary Chinese) , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..