Recall-Oriented Learning of Named Entities in Arabic Wikipedia

We consider the problem of NER in Arabic Wikipedia, a semisupervised domain adaptation setting for which we have no labeled training data in the target domain. To facilitate evaluation, we obtain annotations for articles in four topical groups, allowing annotators to identify domain-specific entity types in addition to standard categories. Standard supervised learning on newswire text leads to poor target-domain recall. We train a sequence model and show that a simple modification to the online learner---a loss function encouraging it to "arrogantly" favor recall over precision---substantially improves recall and F1. We then adapt our model with self-training on unlabeled target-domain data; enforcing the same recall-oriented bias in the self-training stage yields marginal gains.

[1]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[2]  Mihai Surdeanu,et al.  Customizing an Information Extraction System to a New Domain , 2011, RELMS@ACL.

[3]  Andrew Hickl,et al.  What in the world is a Shahab?: Wide Coverage Named Entity Recognition for Arabic , 2006, LREC.

[4]  Nathan D. Ratliff,et al.  Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[5]  Joel Nothman,et al.  Named Entity Recognition in Wikipedia , 2009, PWNLP@IJCNLP.

[6]  Josef van Genabith,et al.  An Automatically Built Named Entity Lexicon for Arabic , 2010, LREC.

[7]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[8]  Satoshi Sekine,et al.  Extended Named Entity Hierarchy , 2002, LREC.

[9]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[10]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[11]  Kareem Darwish,et al.  Simplified Feature Set for Arabic Named Entity Recognition , 2010, NEWS@ACL.

[12]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[13]  Massimiliano Ciaramita,et al.  Supersense Tagging of Unknown Nouns in WordNet , 2003, EMNLP.

[14]  Nan Ye,et al.  Domain adaptive bootstrapping for named entity recognition , 2009, EMNLP.

[15]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[16]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[17]  Wei Ding,et al.  CHINERS: A Chinese Named Entity Recognition System for the Sports Domain , 2003, SIGHAN.

[18]  Olivier Galibert,et al.  Proposal for an Extension of Traditional Named Entities: From Guidelines to Evaluation, an Overview , 2011, Linguistic Annotation Workshop.

[19]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[20]  Imed Zitouni,et al.  Improving Mention Detection Robustness to Noisy Input , 2010, EMNLP.

[21]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[22]  James R. Curran,et al.  Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[23]  Nizar Habash,et al.  Improving NER in Arabic Using a Morphological Tagger , 2008, LREC.

[24]  Xiaoqiang Luo,et al.  A Statistical Model for Multilingual Entity Detection and Tracking , 2004, NAACL.

[25]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[26]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[27]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[28]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[29]  Noah A. Smith,et al.  Softmax-Margin Training for Structured Log-Linear Models , 2010 .

[30]  Slav Petrov,et al.  Uptraining for Accurate Deterministic Question Parsing , 2010, EMNLP.

[31]  Khaled Shaalan,et al.  Arabic Named Entity Recognition from Diverse Text Types , 2008, GoTAL.

[32]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[33]  Nizar Habash Arabic Natural Language Processing , 2008 .

[34]  William W. Cohen,et al.  NER Systems that Suit User’s Preferences: Adjusting the Recall-Precision Trade-off for Entity Extraction , 2006, NAACL.

[35]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[36]  Dan Klein,et al.  Structure compilation: trading structure for features , 2008, ICML '08.

[37]  ChengXiang Zhai,et al.  Exploiting Domain Structure for Named Entity Recognition , 2006, NAACL.

[38]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[39]  Dirk Hovy,et al.  Unsupervised Discovery of Domain-Specific Knowledge from Text , 2011, ACL.

[40]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[41]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[42]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[43]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[44]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[45]  D. Cox,et al.  Statistical significance tests. , 1982, British journal of clinical pharmacology.

[46]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[47]  Noah A. Smith,et al.  Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions , 2010, NAACL.

[48]  J. Curran,et al.  Minimising semantic drift with Mutual Exclusion Bootstrapping , 2007 .

[49]  Yuriy Brun,et al.  That's What She Said: Double Entendre Identification , 2011, ACL.

[50]  Joel Nothman,et al.  Analysing Wikipedia and Gold-Standard Corpora for NER Training , 2009, EACL.

[51]  Dayne Freitag,et al.  Trained Named Entity Recognition using Distributional Clusters , 2004, EMNLP.

[52]  James Allan,et al.  Passage retrieval for incorporating global evidence in sequence labeling , 2011, CIKM '11.

[53]  Fernando Llopis,et al.  Improving Question Answering Using Named Entity Recognition , 2005, NLDB.

[54]  Rada Mihalcea,et al.  Co-training and Self-training for Word Sense Disambiguation , 2004, CoNLL.

[55]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..