Information Extraction: Robust Mention Detection Systems

Information-extraction (IE) research typically focuses on clean-text inputs. However, as we will see in this chapter, an IE engine serving real-world applications yields a high rate of false alarms, due to noisy, less-well-formed input. For example, an application processing output from a multilingual media monitoring system (e.g., TV broadcast) will have to deal with noisy input as well as ­inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. This chapter addresses an important IE task: improving robustness to noise for mention detection (MD). We describe the augmentation of an existing statistical MD system to reduce false alarms in the spurious passages while maintaining performance on clean input, and even improving recall. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. We describe a multi-stage approach to robustness, reflecting the diverse nature of input noise. Detection-error-trade-off analysis is used to evaluate a MD system. In one experiment, with English as the target language, we find that on inputs from other Latin-alphabet languages, we can eliminate 97–98% of false alarms compared to an English-only baseline system, at various fixed miss rates. In another experiment, modeling situations in which genre-specific training is infeasible, we process real data drawn from a financial-transactions text containing mixed languages and data-set codes. On these data, because annotations for data sets like this are typically not available for training a mention detector, we did not include any portion of this data set in the training of the system, yet still can eliminate 60% of the false alarms at various miss rates, compared to the baseline system. These gains come with virtually no loss in accuracy on clean English text.

[1]  Dan Klein,et al.  Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[2]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[3]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[4]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[5]  Mitchell P. Marcus,et al.  Exploring the Statistical Derivation of Transformational Rule Sequences for Part-of-Speech Tagging , 1994, ArXiv.

[6]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[7]  Nigel Collier,et al.  Named Entity Recognition in Vietnamese documents , 2007 .

[8]  Xiaoqiang Luo,et al.  A Statistical Model for Multilingual Entity Detection and Tracking , 2004, NAACL.

[9]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[10]  Amit Srivastava,et al.  Aggregating distributed STT, MT, and information extraction engines: the GALE interoperability-demo system , 2008, INTERSPEECH.

[11]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[12]  Joshua Goodman,et al.  Sequential Conditional Generalized Iterative Scaling , 2002, ACL.

[13]  Jun'ichi Tsujii,et al.  Evaluation and Extension of Maximum Entropy Models with Inequality Constraints , 2003, EMNLP.

[14]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[15]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[16]  Imed Zitouni,et al.  Mention Detection Crossing the Language Barrier , 2008, EMNLP.

[17]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[18]  Dilek Z. Hakkani-Tür,et al.  The ICSI+ multilingual sentence segmentation system , 2006, INTERSPEECH.

[19]  J. M. Prager Linguini: language identification for multilingual documents , 1999 .

[20]  Imed Zitouni,et al.  Factorizing Complex Models: A Case Study in Mention Detection , 2006, ACL.

[21]  David Yarowsky,et al.  Techniques in Speech Acoustics , 1999, Computational Linguistics.

[22]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[23]  Yassine Benajiba,et al.  Arabic Named Entity Recognition: A Feature-Driven Study , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Jos Warmer,et al.  The Implementation of the Amsterdam SGML Parser , 1988, Electron. Publ..

[25]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[26]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[27]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[28]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[29]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[30]  Ralph Grishman,et al.  Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition , 1998, VLC@COLING/ACL.

[31]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Conditional Random Fields , 2008 .

[32]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[33]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[34]  Sergei Nirenburg Proceedings of the sixth conference on Applied natural language processing , 2000 .

[35]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[36]  Xiaoqiang Luo,et al.  A Cascaded Approach to Mention Detection and Chaining in Arabic , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Hwee Tou Ng,et al.  Named Entity Recognition with a Maximum Entropy Approach , 2003, CoNLL.

[38]  Imed Zitouni,et al.  Cross-Language Information Propagation for Arabic Mention Detection , 2009, TALIP.

[39]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .