Feature Extraction for Cursive Language Document Images: Using Discrete Cosine Transform, Discrete Wavelet Transform and Gabor Filter

The efficiency of any machine learning and computer vision system depends largely on the robustness of feature extraction and selection process. In word spotting applications, many appropriate features have been proposed over the years in literature. Most of these features are extracted for Latin text but are used with Oriental script as well. Extracting features that are more specific to Oriental text is also being investigated and a lot of research is being focused on this aspect lately as well. Deep Learning has also been employed for this purpose. In this paper, we have tried investigate the performance of shape based features for Urdu script. Urdu and Arabic belong to the same family of script and both share similar set of alphabet. This means that features investigated on Urdu will give similar performance for Arabic as well as other Oriental scripts. For this paper, we have compiled results on approximately 21000 ligatures belonging to 200 unique classes taken from scanned pages of the popular Urdu series 'Zaawiyya'. This is Higher Education Commission granted project, due to this data set is provided by them. Proposed system gives encouraging results with precision of 88.5% and recall rate of 90.8%.

[1]  Morteza Zahedi,et al.  Farsi/Arabic optical font recognition using SIFT features , 2011, WCIT.

[2]  Imran Siddiqi,et al.  Keyword Based Information Retrieval System for Urdu Document Images , 2015, 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[3]  Ergina Kavallieratou,et al.  Retrieval of historical documents by word spotting , 2009, Electronic Imaging.

[4]  Gaurav Kumar,et al.  A Detailed Review of Feature Extraction in Image Processing Systems , 2014, 2014 Fourth International Conference on Advanced Computing & Communication Technologies.

[5]  Imran Siddiqi,et al.  Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach , 2011, 2011 International Conference on Document Analysis and Recognition.

[6]  Wen-long Song,et al.  Edge detection of plant roots image via Gabor wavelet theory , 2011, 2011 Chinese Control and Decision Conference (CCDC).

[7]  Yaghoub Pourasad,et al.  Farsi Word Spotting and Font Size Recognition , 2012 .

[8]  Meng Li,et al.  Gabor Filter Based Text Extraction from Digital Document Images , 2006, 2006 International Conference on Intelligent Information Hiding and Multimedia.

[9]  Imran Siddiqi,et al.  A comprehensive survey of handwritten document benchmarks: structure, usage and evaluation , 2015, EURASIP J. Image Video Process..

[10]  S. Udomhunsakul Edge detection in ultrasonic images using Gabor filters , 2004, 2004 IEEE Region 10 Conference TENCON 2004..

[11]  G. G. Rajput,et al.  Handwritten Script Recognition using DCT and Wavelet Features at Block Level , 2010 .

[12]  Hafiz Imtiaz,et al.  A DCT-based Local Feature Extraction Algorithm for Palm-print Recognition , 2012 .

[13]  Zhenxing Qian,et al.  An Edge Detection Method in DCT Domain , 2012 .

[14]  Palaiahnakote Shivakumara,et al.  A Robust Wavelet Transform Based Technique for Video Text Detection , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Nicole Vincent,et al.  Word spotting in historical printed documents using shape and sequence comparisons , 2012, Pattern Recognit..

[16]  Jianmin Jiang,et al.  Offline handwritten Arabic cursive text recognition using Hidden Markov Models and re-ranking , 2011, Pattern Recognit. Lett..

[17]  Chew Lim Tan,et al.  Keyword Spotting in Document Images through Word Shape Coding , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[18]  Nicole Vincent,et al.  Feature-based Word Spotting in Ancient Printed Documents , 2008, PRIS.