A Novel OCR System Based on Rough Set Semi-reduct

Most of the well-known OCR engines, such as Google Tesseract, resort to a supervised classification, causing the system drooping in speed with increasing diversity in font style. Hence, with an aim to resolve the tediousness and pitfalls of training an OCR system, but without compromising with its efficiency, we introduce here a novel rough-set-theoretic model. It is designed to effectuate an unsupervised classification of optical characters with a suboptimal attribute set, called the semi-reduct. The semi-reduct attributes are mostly geometric and topological in nature, each having a small range of discrete values estimated from different combinatorial characteristics of rough-set approximations. This eventually leads to quick and easy discernibility of almost all the characters irrespective of their font style. For a few indiscernible characters, Tesseract features are used, but very sparingly, in the final stages of the OCR pipeline so as to ensure an attractive run time of the overall process. Preliminary experimental results demonstrate its further scope and promise.