Improvements in Japanese Voice Search

This paper describes work on Japanese voice-search at Yahoo! Japan. We first describe several implementation details of our WFST-based internal decoder which make the voice-search task more efficient including a simple, but effective, compressed WFST arc representation. This permits a ̃2Gb memory decoder process for a 1 million word vocabulary and 35 million N-gram language model. We then describe our baseline system using the decoder and compare it against two open-source decoders, Juicer and Julius. We also describe our initial attempts to adapt the baseline system through simple language model adaptation using manually transcribed anonymized voice queries. To achieve this we present a sequence of WFST operations which preserve consistency of segmentation between manual and automatic transcriptions. We show that even using this simple adaptation method we obtain a relative reduction of up to 4.6% in sentence error rate and 8.2% in character error rate.

[1]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jithendra Vepa,et al.  Juicer: A Weighted Finite-State Transducer Speech Decoder , 2006, MLMI.

[3]  Mike Schuster,et al.  Speech Recognition for Mobile Devices at Google , 2010, PRICAI.

[4]  Diamantino Caseiro WFST compression for automatic speech recognition , 2010, INTERSPEECH.

[5]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[6]  Brian Roark,et al.  A generalized construction of integrated speech recognition transducers , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Amy Neustein Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics , 2010 .

[8]  Geoffrey Zweig,et al.  Live search for mobile:Web services by voice on the cellphone , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Bhuvana Ramabhadran,et al.  Improved Spoken Query Transcription Using Co-Occurrence Information , 2011, INTERSPEECH.

[10]  Francoise Beaufays,et al.  Google Search by Voice: A Case Study , 2010 .

[11]  Johan Schalkwyk,et al.  Query language modeling for voice search , 2010, 2010 IEEE Spoken Language Technology Workshop.

[12]  Thorsten Brants,et al.  Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice , 2011 .

[13]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[15]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.