Neural Modeling for Named Entities and Morphology (NEMO^2)

Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically-Rich Languages (MRLs) pose a challenge to this basic formulation, as the boundaries of Named Entities do not coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental modeling questions: (i) What should be the basic units to be identified and labeled, are they token-based or morpheme-based? and (ii) How can morphological units be encoded and accurately obtained in realistic (non-gold) scenarios? We empirically investigate these questions on a novel parallel NER benchmark we deliver, with parallel token-level and morpheme-level NER annotations for Modern Hebrew, a morphologically complex language. Our results show that explicitly modeling morphological boundaries consistently leads to improved NER performance, and that a novel hybrid architecture that we propose, in which NER precedes and prunes the morphological decomposition (MD) space, greatly outperforms the standard pipeline approach, on both Hebrew NER and Hebrew MD in realistic scenarios.

[1]  Reut Tsarfaty,et al.  A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing , 2008, ACL.

[2]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[3]  Yue Zhang,et al.  NCRF++: An Open-source Neural Sequence Labeling Toolkit , 2018, ACL.

[4]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6]  Jan Hajic,et al.  Neural Architectures for Nested NER through Linearization , 2019, ACL.

[7]  Mona T. Diab,et al.  Arabic Named Entity Recognition: An SVM-based approach , 2008 .

[8]  Reut Tsarfaty,et al.  Getting the ##life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology? , 2020, SIGMORPHON.

[9]  Khaled Shaalan,et al.  Character convolutions for Arabic Named Entity Recognition with Long Short-Term Memory Networks , 2019, Comput. Speech Lang..

[10]  Iryna Gurevych,et al.  Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks , 2017, ArXiv.

[11]  Nizar Habash,et al.  Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages , 2013, SPMRL@EMNLP.

[12]  Kemal Oflazer,et al.  Recall-Oriented Learning of Named Entities in Arabic Wikipedia , 2012, EACL.

[13]  Karën Fort,et al.  Towards a Methodology for Named Entities Annotation , 2009, Linguistic Annotation Workshop.

[14]  Adam Lopez,et al.  What do character-level models learn about morphology? The case of dependency parsing , 2018, EMNLP.

[15]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[16]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[17]  Quoc V. Le,et al.  Semi-Supervised Sequence Modeling with Cross-View Training , 2018, EMNLP.

[18]  Adam Lopez,et al.  From Characters to Words to in Between: Do We Capture Morphology? , 2017, ACL.

[19]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[20]  Arzucan Özgür,et al.  Named Entity Recognition on Twitter for Turkish using Semi-supervised Learning with Word Embeddings , 2016, LREC.

[21]  Alexander M. Fraser,et al.  Joint Lemmatization and Morphological Tagging with Lemming , 2015, EMNLP.

[22]  Christopher D. Manning,et al.  Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.

[23]  Fei Liu,et al.  Evaluating the Utility of Hand-crafted Features in Sequence Labelling , 2018, EMNLP.

[24]  Khaled Shaalan,et al.  A Survey of Arabic Named Entity Recognition and Classification , 2014, CL.

[25]  Xiaoyong Du,et al.  Subword-level Composition Functions for Learning Word Embeddings , 2018 .

[26]  Leon Derczynski,et al.  Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition , 2017, NUT@EMNLP.

[27]  Wolfgang Seeker,et al.  A Graph-based Lattice Dependency Parser for Joint Morphological Segmentation and Syntactic Analysis , 2015, Transactions of the Association for Computational Linguistics.

[28]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[29]  Dilek Küçük,et al.  A Tweet Dataset Annotated for Named Entity Recognition and Stance Detection , 2019, ArXiv.

[30]  Roland Vollgraf,et al.  Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[31]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[32]  Amit Seker,et al.  What’s Wrong with Hebrew NLP? And How to Make it Right , 2019, EMNLP.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Amit Seker,et al.  Joint Transition-Based Models for Morpho-Syntactic Parsing: Parsing Strategies for MRLs and a Case Study from Modern Hebrew , 2019, TACL.

[35]  Ryan Cotterell,et al.  Don’t Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction , 2019, EMNLP.

[36]  OflazerKemal,et al.  A statistical information extraction system for Turkish , 2003 .

[37]  Liyuan Liu,et al.  Arabic Named Entity Recognition: What Works and What’s Next , 2019, WANLP@ACL 2019.

[38]  Ziqi Zhang,et al.  Named entity recognition : challenges in document annotation, gazetteer construction and disambiguation , 2013 .

[39]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[40]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[41]  Beatrice Alex,et al.  Agile Corpus Annotation in Practice: An Overview of Manual and Automatic Annotation of CVs , 2010, Linguistic Annotation Workshop.

[42]  Khalil Sima'an,et al.  Building a tree-bank of modern hebrew text , 2001 .

[43]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[44]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[45]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[46]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[47]  Christian Biemann,et al.  NoSta-D Named Entity Annotation for German: Guidelines and Dataset , 2014, LREC.

[48]  Tunga Güngör,et al.  Improving Named Entity Recognition by Jointly Learning to Disambiguate Morphological Tags , 2018, COLING.

[49]  Phil Blunsom,et al.  Compositional Morphology for Word Representations and Language Modelling , 2014, ICML.

[50]  Dirk Hovy,et al.  Learning part-of-speech taggers with inter-annotator agreement loss , 2014, EACL.

[51]  Mari Ostendorf,et al.  Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction , 2018, EMNLP.

[52]  Iryna Gurevych,et al.  WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations , 2013, ACL.

[53]  Tie-Yan Liu,et al.  Co-learning of Word Representations and Morpheme Representations , 2014, COLING.

[54]  Jingbo Zhu,et al.  Improved Differentiable Architecture Search for Language Modeling and Named Entity Recognition , 2019, EMNLP.

[55]  Yannick Versley,et al.  Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither , 2010, SPMRL@NAACL-HLT.

[56]  Ryan Cotterell,et al.  Neural Morphological Analysis: Encoding-Decoding Canonical Segments , 2016, EMNLP.

[57]  Yue Zhang,et al.  Design Challenges and Misconceptions in Neural Sequence Labeling , 2018, COLING.

[58]  Tunga Güngör,et al.  Morphological Embeddings for Named Entity Recognition in Morphologically Rich Languages , 2017, ArXiv.

[59]  Hye-Jeong Song,et al.  Comparison of named entity recognition methodologies in biomedical documents , 2018, BioMedical Engineering OnLine.

[60]  Kareem Darwish,et al.  Named Entity Recognition using Cross-lingual Resources: Arabic as an Example , 2013, ACL.