A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities

Although the literature contains reports of very high accuracy figures for the recognition of named entities in text, there are still some named entity phenomena that remain problematic for existing text processing systems. One of these is the ambiguity of conjunctions in candidate named entity strings, an all-too-prevalent problem in corporate and legal documents. In this paper, we distinguish four uses of the conjunction in these strings, and explore the use of a supervised machine learning approach to conjunction disambiguation trained on a very limited set of ‘name internal’ features that avoids the need for expensive lexical or semantic resources. We achieve 84% correctly classified examples using k-fold evaluation on a data set of 600 instances. We argue that further improvements are likely to require the use of wider domain knowledge and name external features.

[1]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[3]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[4]  Mark Steedman,et al.  Dependency and Coordination in the Grammar of Dutch and English , 1985 .

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[6]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[7]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[8]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[9]  Rafael A. Calvo,et al.  Key Element Summarisation: Extracting Information from Company Announcements , 2004, Australian Conference on Artificial Intelligence.

[10]  Raúl Rojas,et al.  Neural Networks - A Systematic Introduction , 1996 .

[11]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[12]  Thamar Solorio,et al.  Improvement of Named Entity Tagging by Machine Learning , 2004 .

[13]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[14]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Koby Crammer,et al.  Flexible Text Segmentation with Structured Multilabel Classification , 2005, HLT.

[17]  Marc Moens,et al.  Description of the LTG System Used for MUC-7 , 1998, MUC.

[18]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[19]  Robert Dale,et al.  Named Entity Extraction with Conjunction Disambiguation , 2006, LREC.