We present a system to enable efficient, collaborative human correction of ASR transcripts, designed to operate in real-time situations, for example, when post-editing live captions generated for news broadcasts. In the system, confusion networks derived from ASR lattices are used to highlight low-confident words and present alternatives to the user for quick correction. The system uses a client-server architecture, whereby information about each manual edit is posted to the server. Such information can be used to dynamically update the one-best ASR output for all utterances currently in the editing pipeline. We propose to make updates in three different ways; by finding a new one-best path through an existing ASR lattice consistent with the correction received; by identifying further instances of out-of-vocabulary terms entered by the user; and by adapting the language model on the fly. Updates are received asynchronously by the client.
[1]
Peter Bell,et al.
Unsupervised Adaptation of Recurrent Neural Network Language Models
,
2016,
INTERSPEECH.
[2]
Lukás Burget,et al.
Recurrent neural network based language model
,
2010,
INTERSPEECH.
[3]
Sanjeev Khudanpur,et al.
Using proxies for OOV keywords in the keyword search task
,
2013,
2013 IEEE Workshop on Automatic Speech Recognition and Understanding.
[4]
Mark J. F. Gales,et al.
The MGB challenge: Evaluating multi-genre broadcast media recognition
,
2015,
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[5]
Kostas Saltzis,et al.
BREAKING NEWS ONLINE
,
2012
.
[6]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..
[7]
Daniel Povey,et al.
The Kaldi Speech Recognition Toolkit
,
2011
.