Multimodal Crowdsourcing for Transcribing Handwritten Documents

Transcription of handwritten documents is an important research topic for multiple applications, such as document classification or information extraction. In the case of historical documents, their transcription allows to preserve cultural heritage because of the amount of historical data contained in those documents. The transcription process can employ state-of-the-art handwritten text recognition systems in order to obtain an initial transcription. This transcription is usually not good enough for the quality standards, but that may speed up the final transcription of the expert. In this framework, the use of collaborative transcription applications (crowdsourcing) has risen in the recent years, but these platforms are mainly limited by the use of non-mobile devices. Thus, the recruiting initiatives get reduced to a smaller set of potential volunteers. In this paper, an alternative that allows the use of mobile devices is presented. The proposal consists of using speech dictation of handwritten text lines. Then, by using multimodal combination of speech and handwritten text images, a draft transcription can be obtained, presenting more quality than that obtained by only using handwritten text recognition. The speech dictation platform is implemented as a mobile device application, which allows for a wider range of population for recruiting volunteers. A real acquisition on the contents of a Spanish historical handwritten book was obtained with the platform. This data was used to perform experiments on the behaviour of the proposed framework. Some experiments were performed to study how to optimise the collaborators effort in terms of number of collaborations, including how many lines and which lines should be selected for the speech dictation.

[1]  Hermann Ney,et al.  White-space models for offline Arabic handwriting recognition , 2008, 2008 19th International Conference on Pattern Recognition.

[2]  José B. Mariño,et al.  Albayzin speech database: design of the phonetic corpus , 1993, EUROSPEECH.

[3]  W. Marsden I and J , 2012 .

[4]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[5]  John H. L. Hansen,et al.  Improved parcel sorting by combining automatic speech and character recognition , 2012, 2012 IEEE International Conference on Emerging Signal Processing Applications.

[6]  Carlos D. Martínez-Hinarejos,et al.  A Multimodal Crowdsourcing Framework for Transcribing Historical Handwritten Documents , 2016, DocEng.

[7]  Tim Polzehl,et al.  Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora , 2016, LREC.

[8]  Maxine Eskénazi,et al.  Speaking to the Crowd: Looking at Past Achievements in Using Crowdsourcing for Speech and Predicting Future Challenges , 2011, INTERSPEECH.

[9]  Camino Vera Combining Handwriting and Speech Recognition for Transcribing Historical Handwritten Documents , 2015 .

[10]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11]  Steve Young,et al.  The HTK book , 1995 .

[12]  Alfons Juan-Císcar,et al.  The RODRIGO Database , 2010, LREC.

[13]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Carlos D. Martínez-Hinarejos,et al.  Combining handwriting and speech recognition for transcribing historical handwritten documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[15]  Yang Liu,et al.  Using N-Best Lists and Confusion Networks for Meeting Summarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Bernhard Rüber,et al.  Obtaining confidence measures from sentence probabilities , 1997, EUROSPEECH.

[17]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[18]  Antonio L. Lagarda,et al.  A Multimodal Approach to Dictation of Handwritten Historical Documents , 2011, INTERSPEECH.

[19]  Per Ola Kristensson,et al.  Asynchronous Multimodal Text Entry Using Speech and Gesture Keyboards , 2011, INTERSPEECH.

[20]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[21]  Kazuya Takeda,et al.  Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech , 2014, EURASIP Journal on Audio, Speech, and Music Processing.

[22]  Timothy J. Hazen Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[24]  Sadaoki Furui,et al.  TOWARD ROBUST MULTIMODAL SPEECH RECOGNITION , 2005 .

[25]  Jian Xue,et al.  Improved confusion network algorithm and shortest path search from word lattice , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[26]  Moisés Pastor,et al.  iATROS: A SPEECH AND HANDWRITING RECOGNITION SYSTEM , 2008 .

[27]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[29]  Carlos D. Martínez-Hinarejos,et al.  Multimodal Output Combination for Transcribing Historical Handwritten Documents , 2015, CAIP.

[30]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.