Multi-Talker Speech Recognition Based on Blind Source Separation with ad hoc Microphone Array Using Smartphones and Cloud Storage

In this paper, we present a multi-talker speech recognition system based on blind source separation with an ad hoc microphone array, which consists of smartphones and cloud storage. In this system, a mixture of voices from multiple speakers is recorded by each speaker’s smartphone, which is automatically transferred to online cloud storage. Our prototype system is realized using iPhone and Dropbox. Although the signals recorded by different iPhones are not synchronized, the blind synchronization technique compensates both the differences in the time offset and the sampling frequency mismatch. Then, auxiliary-function-based independent vector analysis separates the synchronized mixture into each speaker’s voice. Finally, automatic speech recognition is applied to transcribe the speech. By experimental evaluation of the multi-talker speech recognition system using Julius, we confirm that it effectively reduces the speech overlap and improves the speech recognition performance.

[1]  Marc Moonen,et al.  Distributed Adaptive Node-Specific Signal Estimation in Fully Connected Sensor Networks—Part II: Simultaneous and Asynchronous Node Updating , 2010, IEEE Transactions on Signal Processing.

[2]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[3]  Andreas Ziehe,et al.  An approach to blind source separation based on temporal structure of speech signals , 2001, Neurocomputing.

[4]  Shoji Makino,et al.  Optimizing frame analysis with non-integrer shift for sampling mismatch compensation of long recording , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Tomohiro Nakatani,et al.  Modeling inter-node acoustic dependencies with Restricted Boltzmann Machine for distributed microphone array based BSS , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Thomas Hain,et al.  An Analysis of Automatic Speech Recognition with Multiple Microphones , 2011, INTERSPEECH.

[7]  Weifeng Li,et al.  Non-linear mapping for multi-channel speech separation and robust overlapping spech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Minerva M. Yeung,et al.  On the importance of exact synchronization for distributed audio signal processing , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[9]  Zicheng Liu SOUND SOURCE SEPARATION WITH DISTRIBUTED MICROPHONE ARRAYS IN THE PRESENCE OF CLOCK SYNCHRONIZATION ERRORS , 2008 .

[10]  Marc Moonen,et al.  Distributed Adaptive Node-Specific Signal Estimation in Fully Connected Sensor Networks—Part I: Sequential Node Updating , 2010, IEEE Transactions on Signal Processing.

[11]  Marc Moonen,et al.  Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks , 2015, Signal Process..

[12]  Shigeki Sagayama,et al.  Blind Source Separation with Distributed Microphone Pairs Using Permutation Correction by Intra-Pair TDOA Clustering , 2010 .

[13]  Vincent Mohammad Tavakoliy,et al.  Pseudo-coherence-based MVDR beamformer for speech enhancement with ad hoc microphone arrays , 2015, ICASSP 2015.

[14]  Shoji Makino,et al.  Blind compensation of inter-channel sampling frequency mismatch with maximum likelihood estimation in STFT domain , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Iain McCowan,et al.  Microphone array speech recognition: experiments on overlapping speech in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[17]  Jacek Dmochowski,et al.  Blind source separation in a distributed microphone meeting environment for improved teleconferencing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[19]  Sridha Sridharan,et al.  Clustered Blind Beamforming From Ad-Hoc Microphone Arrays , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Shoji Makino,et al.  Blind compensation of interchannel sampling frequency mismatch for ad hoc microphone array based on maximum likelihood estimation , 2015, Signal Process..

[21]  Nobutaka Ono,et al.  Stable and fast update rules for independent vector analysis based on auxiliary function technique , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[22]  Gökhan Tür,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. 1 The CALO Meeting Assistant System , 2022 .

[23]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[25]  Daniel Gatica-Perez,et al.  Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Andreas Stolcke,et al.  The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System , 2007, CLEAR.

[27]  Jacob Benesty,et al.  Pseudo-coherence-based MVDR beamformer for speech enhancement with ad hoc microphone arrays , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Ted S. Wada,et al.  On Dealing with Sampling Rate Mismatches in Blind Source Separation and Acoustic Echo Cancellation , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[29]  Alexander Bertrand,et al.  Special issue on wireless acoustic sensor networks and ad hoc microphone arrays , 2015, Signal Process..

[30]  Lukás Burget,et al.  Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Koichi Shinoda,et al.  Detection of overlapped speech using lapel microphones in meeting , 2013, Speech Commun..