Informal-to-Formal Word Conversion for Persian Language Using Natural Language Processing Techniques

A vast amount of text data is available today on the Internet due to the extensive use of social media. Valuable information can be extracted from this data through natural language processing. However, the process of information extraction can be difficult due to the informal forms of these texts. This paper aims to address this challenge by focusing on the conversion of Persian informal words to formal words by using the spell-checking approach. For this purpose, two datasets for formal and informal words were extracted from the four most visited news websites in Persian. Then Persian informal words were divided into multiple categories based on the level of changes required to build the formal equivalents. These were then converted to the formal forms according to their features. Statistical analyses combined with correction rules were used to produce a “candidate list” to find the best formal candidate equivalents. The performance of our conversion system was evaluated through people's comments extracted from the four most visited Persian (Farsi) news agencies. Results show that our proposed system can detect approximately 94% of the Persian informal words, with the ability to detect 85% of the best equivalent formal words. In addition, the comparison between the proposed system and two well-known Persian spell-checkers, Virastyar and Vafa, shows that in terms of detection and correction, the proposed system outperforms significantly. Further analysis shows that the time complexity of the proposed system is linear.

[1]  Siamak Sarmady,et al.  word representation or word embedding in Persian text , 2017, ArXiv.

[2]  Behrang QasemiZadeh,et al.  CloniZER Spell Checker Adaptive, Language Independent Spell Checker , 2005 .

[3]  Mohammad Sadegh Rasooli,et al.  A Syntactic Valency Lexicon for Persian Verbs : The First Steps towards Persian Dependency Treebank , 2012 .

[4]  Byungun Yoon,et al.  Technology opportunity discovery by structuring user needs based on natural language processing and machine learning , 2019, PloS one.

[5]  Karen Kukich,et al.  Spelling correction for the telecommunications network for the deaf , 1992, CACM.

[6]  R. Larson,et al.  Ezafe, PP and the nature of nominalization , 2020 .

[7]  Mohsen Sharifi,et al.  A novel string distance metric for ranking Persian respelling suggestions , 2012, Natural Language Engineering.

[8]  M. Shirali-Shahreza Pseudo-space Persian/Arabic text steganography , 2008, 2008 IEEE Symposium on Computers and Communications.

[9]  Elena Fernández,et al.  Design of an interactive spell checker: optimizing the list of offered words , 2003, Decis. Support Syst..

[10]  Heshaam Faili,et al.  Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language , 2016, Digit. Scholarsh. Humanit..

[11]  Mário J. Silva,et al.  Spelling Correction for Search Engine Queries , 2004, EsTAL.

[12]  Sartaj Sahni,et al.  String correction using the Damerau-Levenshtein distance , 2019, BMC Bioinformatics.

[13]  V. Vinoth Kumar,et al.  Efficient text summarization method for blind people using text mining techniques , 2020, Int. J. Speech Technol..

[14]  Jonathan T. Grudin,et al.  Error Patterns in Novice and Skilled Transcription Typing , 1983 .

[15]  Bidyut Baran Chaudhuri,et al.  Reversed word dictionary and phonetically similar word grouping based spell-checker to Bangla text , 2014 .

[16]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[17]  B. QasemiZadeh,et al.  Adaptive Language Independent Spell Checking using Intelligent Traverse on a Tree , 2006, 2006 IEEE Conference on Cybernetics and Intelligent Systems.

[18]  Zongtian Liu,et al.  Event Recognition Based on Deep Learning in Chinese Texts , 2016, PloS one.

[19]  Farhad Oroumchian,et al.  Creating a Feasible Corpus for Persian POS Tagging , 2007 .

[20]  Mohammad Sadegh Rasooli,et al.  Effect of adaptive spell checking in Persian , 2011, 2011 7th International Conference on Natural Language Processing and Knowledge Engineering.

[21]  Mohammad Eshghi,et al.  An efficient hybrid solution for pronouncing Farsi text , 2007, Int. J. Speech Technol..