Recognition of printed Arabic text using machine learning

Many papers have been concerned with the recognition of Latin, Chinese and Japanese characters. However, although almost a third of a billion people worldwide, in several different languages, use Arabic characters for writing, little research progress, in both on-line and off-line has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text database, dictionaries, etc. and of course of the cursive nature of its writing rules. The main theme of this paper is the automatic recognition of Arabic printed text using machine learning C4.5. Symbolic machine learning algorithms are designed to accept example descriptions in the form of feature vectors which include a label that identifies the class to which an example belongs. The output of the algorithm is a set of rules that classifies unseen examples based on generalization from the training set. This ability to generalize is the main attraction of machine learning for handwriting recognition. Samples of a character can be preprocessed into a feature vector representation for presentation to a machine learning algorithm that creates rules for recognizing characters of the same class. Symbolic machine learning has several advantages over other learning methods. It is fast in training and in recognition, generalizes well, is noise tolerant and the symbolic representation is easy to understand. The technique can be divided into three major steps: the first step is pre- processing in which the original image is transformed into a binary image utilizing a 300 dpi scanner and then forming the connected component. Second, global features of the input Arabic word are then extracted such as number subwords, number of peaks within the subword, number and position of the complementary character, etc. Finally, machine learning C4.5 is used for character classification to generate a decision tree.

[1]  L. D. Harmon,et al.  Automatic recognition of print and script , 1972 .

[2]  Hussein Almuallim,et al.  A Method of Recognition of Arabic Cursive Handwriting , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Adnan Amin,et al.  A New Structural Technique for Recognizing Printed Arabic Text , 1995, Int. J. Pattern Recognit. Artif. Intell..

[4]  Sabri A. Mahmoud,et al.  Survey and bibliography of Arabic optical text recognition , 1995, Signal Process..

[5]  Adnan Amin,et al.  Off-line Arabic character recognition: the state of the art , 1998, Pattern Recognit..

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  D. Guillevic,et al.  Cursive script recognition: A fast reader scheme , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[8]  Sabah S. Al-Fedaghi,et al.  Machine Recognition of Printed Arabic Text Utilizing Natural Language Morphology , 1991, Int. J. Man Mach. Stud..

[9]  Christopher Raphael,et al.  Language-independent OCR using a continuous speech recognition system , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[10]  Adnan Amin,et al.  Hand-printed arabic character recognition system using an artificial network , 1996, Pattern Recognit..

[11]  Adnan Amin,et al.  ARABIC CHARACTER RECOGNITION , 1997 .

[12]  Adnan Amin,et al.  Page segmentation and classification utilising a bottom-up approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.