Spotting Urdu Stop Words By Zipf's Statistical Approach

This paper presents innovative method to extract stop words from large Urdu text. Stop words are less meaningful words in natural language that slow down language processing and affect language analysis negatively. For language analysis, stop words are removed first to ensure fast data processing. But for Urdu language, there is no reliable stop words removal method. In this work, we applied Zipf's law of two factors dependency with least effort approach to spot stop words in Urdu language corpus. Urdu corpus is specifically created for this research. All Urdu text processing and investigation is carried out in Python 3. 4. Previous work for stop words removal is also investigated and proved less helpful. By using Zipfian approach, out of 500 high frequency words, 358 words are identified as stop words. It is observed that by only focusing on 0.01% of large corpus, almost all the stop words can be spotted to create a stop words list with least manual effort. Furthermore, statistical patterns in stop words, content words, stop words vs content words ratio in data samples and dependency of stop words and content words over data size is also examined. In terms of data size, frequency and ranks, Zipf's law and Heap's law coexist in Urdu stop words. Stop words tend to follow some predictable and measurable patterns that can lead to reliable probabilistic methods for Urdu processing. This deterministic approach provides a strong research ground to explore stop words in Urdu text statistically.

[1]  Rayner Alfred,et al.  An Automatic Construction of Malay Stop Words Based on Aggregation Method , 2016, SCDS.

[2]  Giuliano Antoniol,et al.  The Use of Text Retrieval and Natural Language Processing in Software Engineering , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[3]  John G. Benjafield Keyword frequencies in anglophone psychology , 2019, Scientometrics.

[4]  Muhammad Umair Hassan,et al.  An efficient stop word elimination algorithm for Urdu language , 2017, 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON).

[5]  Qianli D. Y. Ma,et al.  Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems , 2016, PloS one.

[6]  Seyyed Mohammad Hossein Dadgar,et al.  A novel text mining approach based on TF-IDF and Support Vector Machine for news classification , 2016, 2016 IEEE International Conference on Engineering and Technology (ICETECH).

[7]  Deng Na,et al.  Automatically generation and evaluation of Stop words list for Chinese Patents , 2015 .

[8]  Muazzam Maqsood,et al.  An Efficient Segmentation Technique for Urdu Optical Character Recognizer (OCR) , 2019, Lecture Notes in Networks and Systems.

[9]  Haitao Liu,et al.  Zipf's law in 50 languages: its structural pattern, linguistic interpretation, and cognitive motivation , 2018, ArXiv.

[10]  A. Mehri,et al.  Variation of Zipf's exponent in one hundred live languages: A study of the Holy Bible translations , 2017 .

[11]  Mohammed Azmi Al-Betar,et al.  Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering , 2017, Expert Syst. Appl..

[12]  Jatinderkumar R. Saini,et al.  Generating Stopword List for Sanskrit Language , 2017, 2017 IEEE 7th International Advance Computing Conference (IACC).

[13]  Kumiko Tanaka-Ishii,et al.  Do neural nets learn statistical laws behind natural language? , 2017, PloS one.

[14]  Joel Nothman,et al.  Stop Word Lists in Free Open-source Software Packages , 2018 .

[15]  Shehzad Khalid,et al.  Pattern Based Comprehensive Urdu Stemmer and Short Text Classification , 2018, IEEE Access.

[16]  Naima Iltaf,et al.  Extension of Semantic Based Urdu Linguistic Resources Using Natural Language Processing , 2017, 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[17]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.