An Algorithm that Learns What's in a Name

In this paper, we present IdentiFinderTM, a hidden Markov model that learns to recognize and classify names, dates, times, and numerical quantities. We have evaluated the model in English (based on data from the Sixth and Seventh Message Understanding Conferences [MUC-6, MUC-7] and broadcast news) and in Spanish (based on data distributed through the First Multilingual Entity Task [MET-1]), and on speech input (based on broadcast news). We report results here on standard materials only to quantify performance on data available to the community, namely, MUC-6 and MET-1. Results have been consistently better than reported by any other learning algorithm. IdentiFinder's performance is competitive with approaches based on handcrafted rules on mixed case text and superior on text where case information is not available. We also present a controlled experiment showing the effect of training set size on performance, demonstrating that as little as 100,000 words of training data is adequate to get performance around 90% on newswire. Although we present our understanding of why this algorithm performs so well on this class of problems, we believe that significant improvement in performance may still be possible.

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[4]  Herbert Gish,et al.  BBN: description of the PLUM system as used for MUC-4 , 1992, MUC.

[5]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[6]  BBN: description of the PLUM system as used for MUC-5 , 1993, MUC.

[7]  SchwartzRichard,et al.  Coping with ambiguity and unknown words through probabilistic models , 1993 .

[8]  Douglas E. Appelt,et al.  SRI International FASTUS SystemMUC-6 Test Results and Analysis , 1995, MUC.

[9]  Lynette Hirschman,et al.  MITRE: description of the Alembic system used for MUC-6 , 1995, MUC.

[10]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[11]  Nancy Chinchor,et al.  Statistical Significance of MUC-6 Results , 1995, MUC.

[12]  B. M. Sundheim,et al.  Named entity task definition, version 2.1 , 1995 .

[13]  George R. Krupka SRA: description of the SRA system as used for MUC-6 , 1995, MUC.

[14]  Nancy Chinchor,et al.  The Multilingual Entity Task (MET) Overview , 1996, TIPSTER.

[15]  Scott W. Bennett,et al.  Learning to Tag Multilingual Texts Through Observation , 1997, EMNLP.

[16]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[17]  P. Klemm,et al.  Gender differences on Internet cancer support groups. , 1999, Computers in nursing.

[18]  Gary Burnett,et al.  Information Exchange in Virtual Communities: A Comparative Study , 2006, J. Comput. Mediat. Commun..

[19]  V. Savicki,et al.  Gender Language Style and Group Composition in Internet Discussion Groups , 2006 .