Improving Arabic Diacritization with Regularized Decoding and Adversarial Training

Arabic diacritization is a fundamental task for Arabic language processing. Previous studies have demonstrated that automatically generated knowledge can be helpful to this task. However, these studies regard the autogenerated knowledge instances as gold references, which limits their effectiveness since such knowledge is not always accurate and inferior instances can lead to incorrect predictions. In this paper, we propose to use regularized decoding and adversarial training to appropriately learn from such noisy knowledge for diacritization. Experimental results on two benchmark datasets show that, even with quite flawed auto-generated knowledge, our model can still learn adequate diacritics and outperform all previous studies, on both datasets.1

[1]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[2]  Ahmed Abdelali,et al.  Arabic Diacritization: Stats, Rules, and Hacks , 2017, WANLP@EACL.

[3]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[4]  Ehab W. Hermena,et al.  Processing of Arabic diacritical marks: phonological-syntactic disambiguation of homographic verbs and visual crowding effects. , 2015, Journal of experimental psychology. Human perception and performance.

[5]  Kai-Fu Lee,et al.  ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders , 2021, ArXiv.

[6]  C. Negrescu,et al.  AUTOMATIC DIACRITIC RESTORATION FOR A TTS-BASED E-MAIL READER APPLICATION , 2008 .

[7]  Jing Li,et al.  Joint Learning Embeddings for Chinese Words and their Components via Ladder Structured Networks , 2018, IJCAI.

[8]  Yonatan Belinkov,et al.  Arabic Diacritization with Recurrent Neural Networks , 2015, EMNLP.

[9]  Fei Xia,et al.  Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams , 2020, COLING.

[10]  Xiang Wan,et al.  Dependency-driven Relation Extraction with Attentive Graph Convolutional Networks , 2021, ACL.

[11]  Yan Song,et al.  Relation Extraction with Type-aware Map Memories of Word Dependencies , 2021, FINDINGS.

[12]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[13]  Mona T. Diab,et al.  Efficient Convolutional Neural Networks for Diacritic Restoration , 2019, EMNLP.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Yan Song,et al.  Learning Word Representations with Regularization from Prior Knowledge , 2017, CoNLL.

[16]  Nizar Habash,et al.  Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging , 2019, ACL.

[17]  Shuming Shi,et al.  Complementary Learning of Word Embeddings , 2018, IJCAI.

[18]  Yan Song,et al.  Aspect-based Sentiment Analysis with Type-aware Graph Convolutional Networks and Layer Ensemble , 2021, NAACL.

[19]  Danushka Bollegala,et al.  Graph Convolution over Multiple Dependency Sub-graphs for Relation Extraction , 2020, COLING.

[20]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[21]  Chenliang Li,et al.  Dependency Graph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification , 2020, ACL.

[22]  Nizar Habash,et al.  LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual , 2013, ArXiv.

[23]  Kareem Darwish,et al.  Arabic Diacritic Recovery Using a Feature-rich biLSTM Model , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[24]  Xiang Wan,et al.  Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information , 2020, FINDINGS.

[25]  Mahmoud Al-Ayyoub,et al.  Arabic Text Diacritization Using Deep Neural Networks , 2019, 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS).

[26]  Nizar Habash,et al.  Improving Arabic Diacritization through Syntactic Analysis , 2015, EMNLP.

[27]  G. Abandah,et al.  AUTOMATIC ARABIC TEXT DIACRITIZATION USING RECURRENT NEURAL NETWORKS By , 2017 .

[28]  Nizar Habash,et al.  Don’t Throw Those Morphological Analyzers Away Just Yet: Neural Morphological Disambiguation for Arabic , 2017, EMNLP.

[29]  Hazem Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[32]  Gheith Abandah,et al.  ACCURATE AND FAST RECURRENT NEURAL NETWORK SOLUTION FOR THE AUTOMATIC DIACRITIZATION OF ARABIC TEXT , 2020 .

[33]  Nizar Habash,et al.  Adversarial Multitask Learning for Joint Multi-Feature and Multi-Dialect Morphological Modeling , 2019, ACL.

[34]  Fei Xia,et al.  Improving biomedical named entity recognition with syntactic information , 2020, BMC Bioinform..

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Mahmoud Al-Ayyoub,et al.  Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation , 2019, EMNLP.

[37]  Amar Balla,et al.  Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.

[38]  Yan Song,et al.  Joint Aspect Extraction and Sentiment Analysis with Directional Graph Convolutional Networks , 2020, COLING.

[39]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[40]  Majid A. Al-Taee,et al.  Automatic diacritization of Arabic text using recurrent neural networks , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[41]  Mona T. Diab,et al.  A Multitask Learning Approach for Diacritic Restoration , 2020, ACL.