HIT-MW Dataset for Offline Chinese Handwritten Text Recognition

A Chinese handwritten text dataset, HIT-MW, is presented to facilitate the offline Chinese handwritten text recognition. Texts for handcopying are sampled from China Daily corpus with a stratified random manner. To collect naturally written handwriting, forms are distributed by postal mail or middleman instead of face to face. The current version of HIT-MW includes 853 forms and 186,444 characters that are written by more than 780 participants under an unconstrained condition without preprinted character boxes. Its lexical coverage of 3,041 characters is about 99.33% measured on China Daily corpus with about 80 million characters. Handwritten texts of HIT-MW mainly written by college students follow a balanced distribution both in sex and in department. It can be used to conduct Chinese textline segmentation, segmentation-free recognition, and to verify the effect of statistical language model in a real handwriting situation.

[1]  Horst Bunke,et al.  A full English sentence database for off-line handwriting recognition , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Nikos Fakotakis,et al.  The GRUHD database of Greek unconstrained handwriting , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[3]  Horst Bunke,et al.  TV-gram language models for offline handwritten text recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[4]  Samy Bengio,et al.  Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Ching Y. Suen,et al.  Computer recognition of unconstrained handwritten numerals , 1992, Proc. IEEE.

[6]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[8]  Daehwan Kim,et al.  Handwritten Korean character image database PE92 , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[9]  Anthony J. Robinson,et al.  An Off-Line Cursive Handwriting Recognition System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Kenneth M. Sayre,et al.  Machine recognition of handwritten words: A project report , 1973, Pattern Recognit..

[11]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[12]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..