Estimating the Effects of Text Genre, Image Resolution and Algorithmic Complexity needed for Sinhala Optical Character Recognition

While optical character recognition for Latin based scripts have seen near human quality performance, the accuracy for the rounded scripts of South Asia still lags behind. Work on Sinhala OCR has mainly reported on performance on constrained classes of font faces and so been inconclusive. This paper provides a comprehensive series of experiments using conventional machine learning as well as deep learning on texts and font faces of diverse types and in diverse resolutions, in order to present a realistic estimation of the complexity of recognizing the rounded script of Sinhala. While texts of both old and contemporary books can be recognized with over 87% accuracy, those in old newspapers are much harder to recognize owing to poor print quality and resolution.

[1]  Ruvan Weerasinghe,et al.  NLP Applications of Sinhala: TTS & OCR , 2008, IJCNLP.

[2]  Md Saiful Islam,et al.  Implementation of an Optical Character Reader (OCR) for Bengali language , 2015, 2015 International Conference on Data and Software Engineering (ICoDSE).

[3]  S. R. Liyanage,et al.  Sinhala Character Recognition using Tesseract OCR , 2018 .

[4]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[5]  Rohana Priyantha Thilakumara,et al.  Optical character recognition for Sinhala language , 2013, 2013 IEEE Global Humanitarian Technology Conference: South Asia Satellite (GHTC-SAS).

[6]  Roshan Ragel,et al.  Converting printed Sinhala documents to formatted editable text , 2010, 2010 Fifth International Conference on Information and Automation for Sustainability.

[7]  James W. Gair,et al.  Literary Sinhala Inflected Forms: A Synopsis with a Transliteration Guide to Sinhala Script. , 1976 .

[8]  Hiroharu Kawanaka,et al.  Artificial Neural Network Based Sinhala Character Recognition , 2016, ICCVG.

[9]  Ruvan Weerasinghe,et al.  Developing a commercial grade Tamil OCR for recognizing font and size independent text , 2015, 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer).

[10]  C. Vasantha Lakshmi,et al.  Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition , 2012 .