Regularizing CTC in Expectation-Maximization Framework with Application to Handwritten Text Recognition

Connectionist Temporal Classification (CTC) is an objective function for sequence learning and has shown promising results in speech and text recognition tasks. However, its inherent mechanism has not been investigated thoroughly. In this paper, we propose a theoretical explanation of CTC from the perspective of the Expectation-Maximization (EM) algorithm. Based on the EM analysis, we propose a pseudo-label-based L1 regularization and voting decoding algorithm to improve the performance of text recognition. The L1 regularization can reduce the pseudo-label estimation error, while the voting decoding algorithm modifies the built-in decoding logic of CTC and introduces a voting mechanism to the inference process. Experiments of handwritten text recognition show that the proposed method consistently improves over the CTC baseline and yields state-of-the-art results on three benchmark datasets.