Improved Typesetting Models for Historical OCR

We present richer typesetting models that extend the unsupervised historical document recognition system of BergKirkpatrick et al. (2013). The first model breaks the independence assumption between vertical offsets of neighboring glyphs and, in experiments, substantially decreases transcription error rates. The second model simultaneously learns multiple font styles and, as a result, is able to accurately track italic and nonitalic portions of documents. Richer models complicate inference so we present a new, streamlined procedure that is over 25x faster than the method used by BergKirkpatrick et al. (2013). Our final system achieves a relative word error reduction of 22% compared to state-of-the-art results on a dataset of historical newspapers.