Extraction of numerical strings in Farsi/Arabic documents using structural features

In this paper, we present an approach to separate digits and non-digits for numerical string extraction in Farsi/Arabic handwritten or machine-printed document images. Each connected component is labeled as it belongs to a numerical string or not. For this purpose we introduce a set of features which firstly based on the maximum difference between digits and non-digits in Farsi. Secondly their complexity and extraction time are much less than those features used for connected components recognition. For feature classification, a fuzzy rule-based classifier is utilized. Experimental results show an acceptable detection rate with low false positive rate.