A comparative study on clustering techniques for Urdu ligatures in nastaliq font

Clustering is a pivotal step in any Optical Character Recognition (OCR) or Word Spotting system. It serves as a base for the classification and indexing of different words or characters depending upon the level of segmentation. Various clustering methodologies have been applied by different researchers on Latin script based document images. However for Urdu language, which belongs to the family of Arabic and Persian, clustering based indexing systems have not been extensively researched. In this paper, we present a comprehensive study of various known clustering techniques applied on printed Urdu Document Images. The images are segmented into ligatures or partial words and then they are grouped together using different clustering methods. Performance of these methods is evaluated using Calinski-Harabasz, Davis-Bouldin and Dunn indexes.

[1]  Imran Siddiqi,et al.  Language Independent Keyword Based Information Retrieval System of Handwritten Documents using SVM Classifier and Converting Words into Shapes , 2016 .

[2]  Hossein Khosravi,et al.  Clustering low quality Farsi sub-words for word recognition , 2014, 2014 Iranian Conference on Intelligent Systems (ICIS).

[3]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[4]  Jorma Laaksonen,et al.  A comparison of techniques for automatic clustering of handwritten characters , 2002, Object recognition supported by user interaction for service robots.

[5]  Dmitry S. Shalymov,et al.  Arabic handwritten texts clusterization based on Feature Relation Graph (FRG) , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[6]  Shehzad Khalid,et al.  Line and Ligature Segmentation in Printed Urdu Document Images , 2016 .

[7]  Imran Siddiqi,et al.  Feature Extraction for Cursive Language Document Images: Using Discrete Cosine Transform, Discrete Wavelet Transform and Gabor Filter , 2018, MedPRAI '18.

[8]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[9]  Kalpana Shrivastava,et al.  Medical Image Segmentation using Modified K Means Clustering , 2014 .

[10]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Sunila Godara,et al.  An Improved Hierarchical Clustering Technique for Character Recognition , 2012 .

[12]  Imran Siddiqi,et al.  Keyword Based Information Retrieval System for Urdu Document Images , 2015, 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[13]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[14]  Adolfo Martínez Usó,et al.  Unsupervised Image Segmentation Using a Hierarchical Clustering Selection Process , 2006, SSPR/SPR.

[15]  Imran Siddiqi,et al.  Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach , 2011, 2011 International Conference on Document Analysis and Recognition.

[16]  Samee Ullah Khan,et al.  The optical character recognition of Urdu-like cursive scripts , 2014, Pattern Recognit..

[17]  Hong Xu,et al.  SOM Clustering Analysis for Telecommunication Customer Segmentation , 2009, 2009 International Conference on Management and Service Science.

[18]  Yambem Jina Chanu,et al.  Image Segmentation Using K -means Clustering Algorithm and Subtractive Clustering Algorithm , 2015 .

[19]  Giovanni Soda,et al.  Bag of Characters and SOM Clustering for Script Recognition and Writer Identification , 2010, 2010 20th International Conference on Pattern Recognition.

[20]  Ji Tao,et al.  Efficient clustering of face sequences with application to character-based movie browsing , 2008, 2008 15th IEEE International Conference on Image Processing.

[21]  Akanksha Gaur,et al.  Handwritten Hindi character recognition using k-means clustering and SVM , 2015, 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services.