Named Entity Recognition on Arabic-English Code-Mixed Data

As a result of globalization and better quality of education, a significant percentage of the population in Arab countries have become bilingual/multilingual. This has raised to the frequency of code-switching and code-mixing among Arabs in daily communication. Consequently, huge amount of Code-Mixed (CM) content can be found on different social media platforms. Such data could be analyzed and used in different Natural Language Processing (NLP) tasks to tackle the challenges emerging due to this multilingual phenomenon. Named Entity Recognition (NER) is one of the major tasks for several NLP systems. It is the process of identifying named entities in text. However, there is a lack of annotated CM data and resources for such task. This work aims at collecting and building the first annotated CM Arabic-English corpus for NER. Furthermore, we constructed a baseline NER system using deep neural networks and word embedding for Arabic-English CM text and enhanced it using a pooling technique.

[1]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Kemal Oflazer,et al.  Recall-Oriented Learning of Named Entities in Arabic Wikipedia , 2012, EACL.

[4]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[5]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[8]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[9]  Slim Abdennadher,et al.  Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus , 2018, LREC.

[10]  Christian Biemann,et al.  GermEval 2014 Named Entity Recognition Shared Task , 2014 .

[11]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[12]  Somnath Banerjee,et al.  Named Entity Recognition on Code-Mixed Cross-Script Social Media Content , 2017, Computación y Sistemas.

[13]  Slim Abdennadher,et al.  Building a First Language Model for Code-switch Arabic-English , 2017, ACLING.

[14]  Raphaël Troncy,et al.  Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Joakim Nivre,et al.  Multilingual Named Entity Recognition using Hybrid Neural Networks , 2016 .

[17]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[20]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[21]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[22]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[23]  S. Abdennadher,et al.  Arabic Named Entity Recognition Using Clustered Word Embedding , 2018, CICLing.

[24]  Sobha Lalitha Devi,et al.  ESM-IL: Entity Extraction from Social Media Text for Indian Languages @ FIRE 2015 - An Overview , 2015, FIRE Workshops.

[25]  Pushpak Bhattacharyya,et al.  A Hybrid Approach for Entity Extraction in Code-Mixed Social Media Data , 2016, FIRE.

[26]  Slim Abdennadher,et al.  Arabic Name Entity Recognition Using Deep Learning , 2018, SLSP.

[27]  Thomas Niesler,et al.  Automatic Speech Recognition of English-isiZulu Code-switched Speech from South African Soap Operas , 2016, SLTU.

[28]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[29]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.