Boosting Neural POS Tagger for Farsi Using Morphological Information

Farsi (Persian) is a low-resource language that suffers from the data sparsity problem and a lack of efficient processing tools. Due to their broad application in natural language processing tasks, part-of-speech (POS) taggers are one of those important tools that should be considered in this respect. Despite recent work on Farsi tagging, there is still room for improvement. The best reported accuracy so far is 96%, which in special cases can rise to 96.9%. The main problem with existing taggers is their inefficiency in coping with out-of-vocabulary (OOV) words. Addressing both problems of accuracy and OOV words, we developed a neural network-based POS tagger (NPT) that performs efficiently on Farsi. Despite using less data, NPT provides better results in comparison to state-of-the-art systems. Our proposed tagger performs with an accuracy of 97.4%, with performance highly influenced by morphological features. We carry out a shallow morphological analysis and show considerable improvement over the baseline configuration.

[1]  Helmut Schmid,et al.  Part-of-Speech Tagging With Neural Networks , 1994, COLING.

[2]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[3]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[4]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[5]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[6]  Mojgan Seraji,et al.  A Basic Language Resource Kit for Persian , 2012, LREC.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Stefan Evert,et al.  Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus , 2009 .

[9]  Farhad Oroumchian,et al.  Creating a Feasible Corpus for Persian POS Tagging , 2007 .

[10]  Nasredine Semmar,et al.  Unsupervised and Lightly Supervised Part-of-Speech Tagging Using Recurrent Neural Networks , 2015, PACLIC.

[11]  Xiaoqing Zheng,et al.  Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[12]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[13]  Razvan Pascanu,et al.  Theano: Deep Learning on GPUs with Python , 2012 .

[14]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[15]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[16]  J.A. Perez-Ortiz,et al.  Part-of-speech tagging with recurrent neural networks , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[17]  Karine Megerdoomian,et al.  Developing a Persian Part of Speech Tagger , 2005 .

[18]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19]  Mojgan Seraji,et al.  A Statistical Part-of-Speech Tagger for Persian , 2011, NODALIDA.

[20]  Yue Zhang,et al.  Tagging The Web: Building A Robust Web Tagger with Neural Network , 2014, ACL.

[21]  Behrouz Minaei-Bidgoli,et al.  A Persian Part-Of-Speech Tagger Based on Morphological Analysis , 2010, LREC.

[22]  K. P. Soman,et al.  Deep Belief Network Based Part-of-Speech Tagger for Telugu Language , 2016 .

[23]  Daniel Jurafsky,et al.  Morphological features help POS tagging of unknown words across language varieties , 2005, IJCNLP.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Clément Farabet,et al.  Implementing Neural Networks Efficiently , 2012, Neural Networks: Tricks of the Trade.

[26]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[27]  Sandra M. Aluísio,et al.  Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese , 2014, Journal of the Brazilian Computer Society.

[28]  Hai Zhao,et al.  Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network , 2015, ArXiv.

[29]  Farhad Oroumchian,et al.  Evaluation of part of speech tagging on Persian text , 2007 .

[30]  Mahmood Bijankhan,et al.  Lessons from building a Persian written corpus: Peykare , 2011, Lang. Resour. Evaluation.

[31]  Christopher D. Manning Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.