Script Identification of Multi-Script Documents: A Survey

In recent years, with the widespread of Internet and digitized processing of multi-script documents worldwide, script identification techniques have become more important in the pattern recognition field. Script identification concerns methods for identifying different scripts in multi-lingual, multi-script documents. This paper presents a comprehensive overview on research activities in the field and focuses on the most valuable results obtained so far. The most vital processes in script identification are addressed in detail: identification and discriminating methods, features extraction (local and global), and classification. Different kinds of approaches have been developed and promising results have been achieved. This paper reports SoA performance results. This paper reports methods concerning handwritten, printed, and hybrid document processing. More research is necessary to meet the performance levels essential for everyday applications.

[1]  Umapada Pal,et al.  Word-Wise Script Identification from Video Frames , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[2]  Miguel Angel Ferrer-Ballester,et al.  LBP Based Line-Wise Script Identification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[3]  D. S. Guru,et al.  Appearance Based Models in Document Script Identification , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[4]  Adel M. Alimi,et al.  Fractal-based system for Arabic/Latin, printed/handwritten script identification , 2008, 2008 19th International Conference on Pattern Recognition.

[5]  Sally L. Wood,et al.  Language identification for printed text independent of segmentation , 1995, Proceedings., International Conference on Image Processing.

[6]  Fu Chang,et al.  Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines , 2011, 2011 International Conference on Document Analysis and Recognition.

[7]  Réjean Plamondon,et al.  Automatic signature verification and writer identification - the state of the art , 1989, Pattern Recognit..

[8]  Jie Ding,et al.  Classification of oriental and European scripts by using characteristic features , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[9]  Renu Dhir,et al.  Performance analysis of feature extractors and classifiers for script recognition of English and Gurmukhi words , 2012, DAR '12.

[10]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Bidyut Baran Chaudhuri,et al.  Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[12]  Jie Ding,et al.  Differential Between Oriental and European Scripts by Statistical Features , 1998, Int. J. Pattern Recognit. Artif. Intell..

[13]  M. M. Kodabagi,et al.  A fuzzy approach for word level script identification of text in low resolution display board images using wavelet features , 2013, 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[14]  Palaiahnakote Shivakumara,et al.  New Gradient-Spatial-Structural Features for video script identification , 2015, Comput. Vis. Image Underst..

[15]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[16]  Xiang Bai,et al.  Script identification in the wild via discriminative convolutional neural network , 2016, Pattern Recognit..

[17]  Shijian Lu,et al.  Video Script Identification Based on Text Lines , 2011, 2011 International Conference on Document Analysis and Recognition.

[18]  R. R. Manza,et al.  Video scene segmentation to separate script , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[19]  Umapada Pal,et al.  Word-wise Sinhala Tamil and English script identification using Gaussian kernel SVM , 2008, 2008 19th International Conference on Pattern Recognition.

[20]  V. S. Malemath,et al.  WORD-WISE SCRIPT IDENTIFICATION BASED ON MORPHOLOGICAL RECONSTRUCTION IN PRINTED BILINGUAL DOCUMENTS , 2006 .

[21]  Patrick Kelly,et al.  Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  David S. Doermann,et al.  Identifying script on word-level with informational confidence , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[23]  Mallikarjun Hangarge,et al.  Directional Discrete Cosine Transform for Handwritten Script Identification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[24]  Adel M. Alimi,et al.  A New Arabic Printed Text Image Database and Evaluation Protocols , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[25]  Bidyut Baran Chaudhuri,et al.  Identification of different script lines from multi-script documents , 2002, Image Vis. Comput..

[26]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[27]  Andrew Busch,et al.  Multi-font Script Identification Using Texture-Based Features , 2006, ICIAR.

[28]  Debashis Ghosh,et al.  Handwritten script identification using possibilistic approach for cluster analysis , 2013 .

[29]  Jianjia Pan,et al.  A rotation-robust script identification based on BEMD and LBP , 2011, 2011 International Conference on Wavelet Analysis and Pattern Recognition.

[30]  X. Ping,et al.  Script identification based on wavelet energy histogram moment features , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[31]  Umapada Pal,et al.  Identification of Indic Scripts on Torn-Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[32]  Abdel Belaïd,et al.  Identification of Machine-Printed and Handwritten Words in Arabic and Latin Scripts , 2013, ICDAR.

[33]  Hamzah Luqman,et al.  KAFD Arabic font database , 2014, Pattern Recognit..

[34]  Alireza Alaei,et al.  Word-Wise Handwritten Persian and Roman Script Identification , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[35]  V. S. Malemath,et al.  Word Level Script Identification in Bilingual Documents through Discriminating Features , 2007, 2007 International Conference on Signal Processing, Communications and Networking.

[36]  Slim Kanoun,et al.  ALTID : Arabic/Latin Text Images Database for recognition research , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[37]  Adel M. Alimi,et al.  Script and nature differentiation for Arabic and Latin text images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[38]  Feiyue Huang,et al.  Automatic script identification in the wild , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[39]  Somaya Al-Máadeed,et al.  QUWI: An Arabic and English Handwriting Dataset for Offline Writer Identification , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[40]  Palaiahnakote Shivakumara,et al.  Gradient-Angular-Features for Word-wise Video Script Identification , 2014, 2014 22nd International Conference on Pattern Recognition.

[41]  Andreas Dengel,et al.  ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[42]  Marcus Liwicki,et al.  A sequence learning approach for multiple script identification , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[43]  David S. Doermann,et al.  Gabor filter based multi-class classifier for scanned document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[44]  Yang Yang,et al.  Script Identification of Document Image Analysis , 2006, First International Conference on Innovative Computing, Information and Control - Volume I (ICICIC'06).

[45]  Latesh G. Malik,et al.  A Survey of Methods and Strategies for Feature Extraction in Handwritten Script Identification , 2008, 2008 First International Conference on Emerging Trends in Engineering and Technology.

[46]  Anil K. Jain,et al.  Online handwritten script recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Santanu Chaudhury,et al.  Trainable script identification strategies for Indian languages , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[48]  S. Bergler,et al.  Skew detection, page segmentation, and script classification of printed document images , 1998, SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218).

[49]  Henry S. Baird,et al.  Language identification in Complex, Unoriented, and Degraded Document Images , 1996, DAS.

[50]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[51]  Kaushik Roy,et al.  Trilingual Script Separation of Handwritten Postal Document , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[52]  G. S. Peake,et al.  Script and language identification from document images , 1997, Proceedings Workshop on Document Image Analysis (DIA'97).

[53]  U. Pal,et al.  Recognition of printed Urdu script , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[54]  Fumitaka Kimura,et al.  Identification of Japanese and English Script from a Single Document Page , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[55]  Rumaan Bashir,et al.  Identification of Kashmiri script in a bilingual document image , 2013, 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013).

[56]  Partha Pratim Roy,et al.  Multi-lingual text recognition from video frames , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[57]  Haikal El Abed,et al.  LAMIS-MSHD: A Multi-script Offline Handwriting Database , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[58]  Muhammad Sher,et al.  Numeral recognition for Urdu script in unconstrained environment , 2009, 2009 International Conference on Emerging Technologies.

[59]  P. S. Hiremath,et al.  Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image , 2008, Pattern Recognit. Lett..

[60]  Yuan Yan Tang,et al.  Discrimination of Oriental and Euramerican scripts using fractal feature , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[61]  Shijian Lu,et al.  Automatic Detection of Document Script and Orientation , 2007 .

[62]  Debashis Ghosh,et al.  Script Recognition—A Review , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Bidyut Baran Chaudhuri,et al.  Indian script character recognition: a survey , 2004, Pattern Recognit..

[64]  U. Pal,et al.  A system for word-wise handwritten script identification for Indian postal automation , 2004, Proceedings of the IEEE INDICON 2004. First India Annual Conference, 2004..

[65]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[66]  Alireza Behrad,et al.  Farsi and Latin script identification using curvature scale space features , 2010, 10th Symposium on Neural Network Applications in Electrical Engineering.

[67]  Vivek Singhal,et al.  Script-based classification of hand-written text documents in a multilingual environment , 2003, Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation.

[68]  Sk. Md. Obaidullah,et al.  Script Identification from Handwritten Document , 2011, 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics.

[69]  Chew Lim Tan,et al.  Script identification of camera-based images , 2008, 2008 19th International Conference on Pattern Recognition.

[70]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[71]  Prakash K. Aithal,et al.  Text line script identification for a tri-lingual document , 2010, 2010 Second International conference on Computing, Communication and Networking Technologies.

[72]  Shijian Lu,et al.  New Spatial-Gradient-Features for Video Script Identification , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[73]  A.G. Ramakrishnan,et al.  Gabor filters for document analysis in Indian bilingual documents , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[74]  J. Sil,et al.  Cluster Validation Using Splitting and Merging Technique , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[75]  A. G. Ramakrishnan,et al.  Word level multi-script identification , 2008, Pattern Recognit. Lett..

[76]  Umapada Pal,et al.  Bag-of-Visual Words for word-wise video script identification: A study , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[77]  U. Pal,et al.  English, Devnagari and Urdu Text Identification , 2005 .

[78]  Basanna V. Dhandra,et al.  Word-wise Script Identification from Bilingual Documents Based on Morphological Reconstruction , 2007, 2006 1st International Conference on Digital Information Management.

[79]  C. R. K. Reddy,et al.  Heuristic based script identification from multilingual text documents , 2012, 2012 1st International Conference on Recent Advances in Information Technology (RAIT).

[80]  Mohammad Alshayeb,et al.  KHATT: Arabic Offline Handwritten Text Database , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[81]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[82]  Sridha Sridharan,et al.  Texture for script identification , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Tieniu Tan,et al.  Script and Language Identification from Document Images , 1997, BMVC.

[84]  Bidyut Baran Chaudhuri,et al.  Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[86]  A. G. Ramakrishnan,et al.  Script identification in printed bilingual documents , 2002, Document Analysis Systems.

[87]  Anil K. Jain,et al.  Online handwritten script recognition , 2004 .

[88]  Miguel Angel Ferrer-Ballester,et al.  Multiple Training - One Test Methodology for Handwritten Word-Script Identification , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[89]  Bidyut Baran Chaudhuri,et al.  Composite Script Identification and Orientation Detection for Indian Text Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[90]  Renu Dhir,et al.  Script Identification of Pre-segmented Multi-font Characters and Digits , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[91]  Hua Wang,et al.  Multilingual document recognition research and its application in China , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[92]  Nibaran Das,et al.  Comparison of different classifiers for script identification from handwritten document , 2013, 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC).

[93]  Patrick Kelly,et al.  Script and language identification for handwritten document images , 1999, International Journal on Document Analysis and Recognition.

[94]  Zeng Li Multi-Scale Wavelet Texture-Based Script Identification Method , 2000 .

[95]  S. A. Chaudhari,et al.  An OCR for separation and identification of mixed English — Gujarati digits using kNN classifier , 2013, 2013 International Conference on Intelligent Systems and Signal Processing (ISSP).

[96]  Nibaran Das,et al.  Numeral Script Identification from Handwritten Document Images , 2015 .

[97]  P. S. Hiremath,et al.  Script identification in a handwritten document image using texture features , 2010, 2010 IEEE 2nd International Advance Computing Conference (IACC).

[98]  Mallikarjun Hangarge,et al.  Global and Local Features Based Handwritten Text Words and Numerals Script Identification , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[99]  Subhadip Basu,et al.  Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script , 2010, ArXiv.

[100]  Umapada Pal,et al.  SVM Based Scheme for Thai and English Script Identification , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[101]  B. Freisleben,et al.  Script recognition in images with complex backgrounds , 2005, Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, 2005..

[102]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[103]  U. Pal,et al.  Neural network based word-wise handwritten script identification system for Indian postal automation , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[104]  Ning Wang,et al.  Noise Tolerant Script Identification of Printed Oriental and English Documents Using a Downgraded Pixel Density Feature , 2010, 2010 20th International Conference on Pattern Recognition.

[105]  Bruno Grilhères,et al.  The Maurdor Project: Improving Automatic Processing of Digital Documents , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[106]  R. D. Sudhaker Samuel,et al.  A Novel Bilingual OCR for Printed Malayalam-English Text Based on Gabor Features and Dominant Singular Values , 2009, 2009 International Conference on Digital Image Processing.

[107]  Tetsushi Wakabayashi,et al.  Handwritten Numeral Recognition of Six Popular Indian Scripts , 2007 .

[108]  Edouard Geoffrois,et al.  Results of the RIMES Evaluation Campaign for Handwritten Mail Processing , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[109]  P. Nagabhushan,et al.  Script Identification Based on Morphological Reconstruction in Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[110]  M. C. Padma Entropy Based Texture Features Useful for Automatic Script Identification , 2010 .

[111]  Mita Nasipuri,et al.  Offline Script Identification from multilingual Indic-script documents: A state-of-the-art , 2015, Comput. Sci. Rev..

[112]  Adel M. Alimi,et al.  Language and Script Identification Based on Steerable Pyramid Features , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[113]  Venu Govindaraju,et al.  Statistical script independent word spotting in offline handwritten documents , 2014, Pattern Recognit..

[114]  A. G. Ramakrishnan,et al.  HVS Inspired System for Script Identification in Indian Multi-script Documents , 2006, Document Analysis Systems.

[115]  Mark R. Stevens,et al.  Automatic feature selection with applications to script identification of degraded documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[116]  Umapada Pal,et al.  A study on word-level multi-script identification from video frames , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[117]  Umapada Pal,et al.  ICDAR2015 Competition on Video Script Identification (CVSI 2015) , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[118]  Farbod Razzazi,et al.  Automatic language identification of bilingual English and Farsi scripts , 2009, 2009 International Conference on Application of Information and Communication Technologies.

[119]  Mita Nasipuri,et al.  Word-level script identification for handwritten Indic scripts , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[120]  Abdel Belaïd,et al.  Co-occurrence Matrix of Oriented Gradients for word script and nature identification , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[121]  Fumitaka Kimura,et al.  Script Identification – A Han and Roman Script Perspective , 2010, 2010 20th International Conference on Pattern Recognition.

[122]  Slim Kanoun,et al.  Database for Arabic Printed Text Recognition Research , 2013, ICIAP.

[123]  Bidyut Baran Chaudhuri,et al.  Automatic separation of words in multi-lingual multi-script Indian documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[124]  N. V. Subbareddy,et al.  Neural network based system for script identification in Indian documents , 2002 .

[125]  Shijian Lu,et al.  Script and Language Identification in Noisy and Degraded Document Images , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[126]  Adel M. Alimi,et al.  Arabic and Latin Script Identification in Printed and Handwritten Types Based on Steerable Pyramid Features , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[127]  Christian Viard-Gaudin,et al.  Information Retrieval Model for Online Handwritten Script Identification , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[128]  Ching Y. Suen,et al.  Script identification using steerable Gabor filters , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[129]  Umapada Pal,et al.  Two-stage Approach for Word-wise Script Identification , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[130]  Nibaran Das,et al.  Transform based approach for Indic script identification from handwritten document images , 2015, 2015 3rd International Conference on Signal Processing, Communication and Networking (ICSCN).

[131]  P. A. Vijaya,et al.  Monothetic separation of Telugu, Hindi and English text lines from a multi script document , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.