Exploring More Representative States of Hidden Markov Model in Optical Character Recognition: A Clustering-Based Model Pre-Training Approach

Hidden Markov Model (HMM) is an effective method to describe sequential signals in many applications. As to model estimation issue, common training algorithm only focuses on the optimization of model parameters. However, model structure influences system performance as well. Although some structure optimization methods are proposed, they are usually implemented as an independent module before parameter optimization. In this paper, the clustering feature of states in HMM is discussed through comparing the mechanism of Quadratic Discriminant Function (QDF) classifier and HMM. Then, through the clustering effect of Viterbi training and Baum–Welch training, a novel clustering-based model pre-training approach is proposed. It can optimize model parameters and model structure by turns, until the representative states of all models are explored. Finally, the proposed approach is evaluated on two typical OCR applications, printed and handwritten Arabic text line recognition. And it is compared with some other optimization methods. The improvement of character recognition performance proves the proposed approach can make more precise state allocation. And the representative states are benefit to HMM decoding.