Multimodality, interactivity, and crowdsourcing for document transcription

Knowledge mining from documents usually use document engineering techniques that allow the user to access the information contained in documents of interest. In this framework, transcription may provide efficient access to the contents of handwritten documents. Manual transcription is a time‐consuming task that can be sped up by using different mechanisms. A first possibility is employing state‐of‐the‐art handwritten text recognition systems to obtain an initial draft transcription that can be manually amended. A second option is employing crowdsourcing to obtain a massive but not error‐free draft transcription. In this case, when collaborators employ mobile devices, speech dictation can be used as a transcription source, and speech and handwritten text recognition can be fused to provide a better draft transcription, which can be amended with even less effort. A final option is using interactive assistive frameworks, where the automatic system that provides the draft transcription and the transcriber cooperate to generate the final transcription. The novel contributions presented in this work include the study of the data fusion on a multimodal crowdsourcing framework and its integration with an interactive system. The use of the proposed solutions reduces the required transcription effort and optimizes the overall performance and usability, allowing for a better transcription process.

[1]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Bernhard Rüber,et al.  Obtaining confidence measures from sentence probabilities , 1997, EUROSPEECH.

[3]  Antonio L. Lagarda,et al.  A Multimodal Approach to Dictation of Handwritten Historical Documents , 2011, INTERSPEECH.

[4]  Kunio Doi,et al.  Computer-aided diagnosis in medical imaging: Historical review, current status and future potential , 2007, Comput. Medical Imaging Graph..

[5]  Maxine Eskénazi,et al.  Speaking to the Crowd: Looking at Past Achievements in Using Crowdsourcing for Speech and Predicting Future Challenges , 2011, INTERSPEECH.

[6]  Steve Young,et al.  The HTK book , 1995 .

[7]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[8]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[9]  Alejandro Héctor Toselli,et al.  Computer Assisted Transcription of Handwritten Text Images , 2007 .

[10]  John H. L. Hansen,et al.  Improved parcel sorting by combining automatic speech and character recognition , 2012, 2012 IEEE International Conference on Emerging Signal Processing Applications.

[11]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[12]  Kazuya Takeda,et al.  Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech , 2014, EURASIP Journal on Audio, Speech, and Music Processing.

[13]  Carlos D. Martínez-Hinarejos,et al.  Multimodal Crowdsourcing for Transcribing Handwritten Documents , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Alfons Juan-Císcar,et al.  A System Architecture to Support Cost-Effective Transcription and Translation of Large Video Lecture Repositories , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Carlos D. Martínez-Hinarejos,et al.  An Interactive Approach with Off-Line and On-Line Handwritten Text Recognition Combination for Transcribing Historical Documents , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[17]  Sadaoki Furui,et al.  A Robust Multimodal Speech Recognition Method using Optical Flow Analysis , 2005 .

[18]  Carlos D. Martínez-Hinarejos,et al.  A Multimodal Crowdsourcing Framework for Transcribing Historical Handwritten Documents , 2016, DocEng.

[19]  Tim Polzehl,et al.  Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora , 2016, LREC.

[20]  Alejandro Revuelta-Mart A Computer Assisted Speech Transcription System , 2012 .

[21]  Alejandro Héctor Toselli,et al.  Projection Profile Based Algorithm for Slant Removal , 2004, ICIAR.

[22]  Sadaoki Furui,et al.  TOWARD ROBUST MULTIMODAL SPEECH RECOGNITION , 2005 .

[23]  Hermann Ney,et al.  Integrated Handwriting Recognition And Interpretation Using Finite-State Models , 2004, Int. J. Pattern Recognit. Artif. Intell..

[24]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[25]  José B. Mariño,et al.  Albayzin speech database: design of the phonetic corpus , 1993, EUROSPEECH.

[26]  Moisés Pastor,et al.  iATROS: A SPEECH AND HANDWRITING RECOGNITION SYSTEM , 2008 .

[27]  Hermann Ney,et al.  Statistical Approaches to Computer-Assisted Translation , 2009, CL.

[28]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Alejandro Héctor Toselli,et al.  Improvements in the Computer Assisted Transcription System of Handwritten Text Images , 2008, PRIS.

[30]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[31]  Ismael García-Varea,et al.  A Computer Assisted Speech Transcription System , 2012, EACL.

[32]  Camino Vera Combining Handwriting and Speech Recognition for Transcribing Historical Handwritten Documents , 2015 .

[33]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[34]  Judit De Diego Muñoz Reseña de READ: Recognition and Enrichment of Archival Documents , 2019, Revista de Humanidades Digitales.

[35]  Alejandro Héctor Toselli,et al.  Using Mouse Feedback in Computer Assisted Transcription of Handwritten Text Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[36]  Alejandro Héctor Toselli Rossi,et al.  Multimodal Interactive Handwritten Text Transcription , 2012, Series in Machine Perception and Artificial Intelligence.

[37]  Carlos D. Martínez-Hinarejos,et al.  Collaborator Effort Optimisation in Multimodal Crowdsourcing for Transcribing Historical Manuscripts , 2016, IberSPEECH.

[38]  Carl Machover The CAD/CAM handbook , 1996 .

[39]  Efstathios Stamatatos,et al.  Improving the quality of degraded document images , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[40]  H. Balk,et al.  IMPACT: Improving Access to Text , 2008 .

[41]  Nikko Ström Automatic Continuous Speech Recognition with Rapid Speaker Adaptation for Human/machine Interaction , 1997 .

[42]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Timothy J. Hazen Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  S. Impedovo,et al.  Optical Character Recognition - a Survey , 1991, Int. J. Pattern Recognit. Artif. Intell..

[45]  Alfons Juan-Císcar,et al.  The RODRIGO Database , 2010, LREC.