Complex Network of Urdu Language

This work proposes state of the art technique to examine composition patterns and topological structure of Urdu language. The improved method explores Urdu text in form of co-occurrence network graph within framework of complex network theory. For the first time, Urdu text is successfully transformed into graph despite having difficulties in dealing with Nastalik script, unavailability of resources and limited support by language processing tools. We have constructed an open and unannotated corpus of more than 3 million words using random forest approach. An un-directed, un-weighted graph from co-occurrence network of Urdu is created in python 3.4. Resulting network designed with bag of bigrams model consists of 5180 nodes and 101415 edges. Deep statistical analysis of graph is performed in graph visualization tool Gephi 0.9.2. Furthermore, a null model of similar size according to Erdos-Renyi random graph is generated to compare with Urdu network. Comparison is based on average path length, clustering coefficient and hierarchy of both networks. From analysis of these key features, it is observed that Urdu network graph differs from random network. Smaller average path length and high clustering coefficient also confirm small world effect in Urdu language. Additionally, 11 communities are detected in Urdu network unlike random network where only one community exists. Statistical facts reveal that Urdu network is a scale free network with layered composition pattern. Small world effect and scale free behavior of Urdu declare it a complex network with paradigmatic hierarchy in terms of authority distribution among words.

[1]  Julie S. Amberg,et al.  Introduction: What is language? , 2009 .

[2]  Haitao Liu,et al.  How does language change as a lexical network? An investigation based on written Chinese word co-occurrence networks , 2018, PloS one.

[3]  Siew Ann Cheong,et al.  Functional shortcuts in language co-occurrence networks , 2018, PloS one.

[4]  Jian Pei,et al.  A Survey on Network Embedding , 2017, IEEE Transactions on Knowledge and Data Engineering.

[5]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[6]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[7]  Dhanya Pramod,et al.  Document clustering: TF-IDF approach , 2016, 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT).

[8]  Hua-Wei Shen,et al.  Community Structure of Complex Networks , 2013, Springer Theses.

[9]  Jerzy W. Grzymala-Busse Artificial Intelligence - Introduction , 1993, ICCI.

[10]  Jure Leskovec,et al.  Higher-order organization of complex networks , 2016, Science.

[11]  Evan M. Gordon,et al.  Re-emergence of modular brain networks in stroke recovery , 2018, Cortex.

[12]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[13]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[14]  Paolo De Los Rios,et al.  Models of Complex Networks , 2007 .

[15]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[16]  M. Kendall,et al.  The Statistical Study of Literary Vocabulary , 1944, Nature.

[17]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[18]  Michael Small,et al.  Complex network analysis of time series , 2016 .

[19]  A. Dembo,et al.  A large deviation principle for the Erdős–Rényi uniform random graph , 2018 .

[20]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.