Improved Transcription and Speaker Identification System for Concurrent Speech in Bahasa Indonesia Using Recurrent Neural Network

Bahasa Indonesia is one of the most prominent low-resource Languages that still lack development in regards to communication-assisting technology. This paper proposes an improved system for generating transcript and identifying speakers from a concurrent speech in Bahasa Indonesia. The proposed method is applicable in a situation such as an online meeting and remote conference. The system combines Reinforced Learning (RL) Model with pitch-aware speech separation to identify the speakers in a concurrent speech. A Recurrent Neural Network (RNN) is utilized to generate the text transcript which is later improved by an external language model and spelling correction model. The proposed system was able to identify up to 5 speakers with a variable degree of confidence and generate a transcript for each of them with better quality compared to other methods when evaluated with several metrics. The result shows that the proposed method perform better compared to the baseline method, even in the single-speaker situation, and function in the simultaneous-speech situation, with an average Word Error Rate (WER) of 16.59% for two speakers, 26.72% for three speakers, and 31.50% for four speakers.

[1]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yonghong Yan,et al.  Investigation of knowledge transfer approaches to improve the acoustic modeling of Vietnamese ASR system , 2019, IEEE/CAA Journal of Automatica Sinica.

[3]  Roberto Togneri,et al.  Review of Multi-Channel Source Separation in Realistic Environments , 2010 .

[4]  Ning Cheng,et al.  An Improved A Priori MMSE Spectral Subtraction Method for Speech Enhancement , 2007, 2007 3rd International Workshop on Signal Design and Its Applications in Communications.

[5]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tsuyoshi Usagawa,et al.  Automatic Lecture Video Content Summarizationwith Attention-based Recurrent Neural Network , 2019, 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT).

[7]  Ayu Purwarianti,et al.  Indonesian automatic speech recognition system using English-based acoustic model , 2011, Proceedings of the 2011 International Conference on Electrical Engineering and Informatics.

[8]  Junichi Yamagishi,et al.  A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation , 2019, ArXiv.

[9]  Satoshi Nakamura,et al.  Indonesian speech recognition for hearing and speaking impaired people , 2004, INTERSPEECH.

[10]  Qiang Li,et al.  Kalman Filter and Its Application , 2015, 2015 8th International Conference on Intelligent Networks and Intelligent Systems (ICINIS).

[11]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yi Hu,et al.  Subjective comparison and evaluation of speech enhancement algorithms , 2007, Speech Commun..

[13]  Christian Hacker,et al.  Revising Perceptual Linear Prediction (PLP) , 2005, INTERSPEECH.

[14]  Hynek Hermansky,et al.  Phoneme vs Grapheme Based Automatic Speech Recognition , 2004 .

[15]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[16]  Ericks Rachmat Swedia,et al.  Deep Learning Long-Short Term Memory (LSTM) for Indonesian Speech Digit Recognition using LPC and MFCC Feature , 2018, 2018 Third International Conference on Informatics and Computing (ICIC).

[17]  Lei Xie,et al.  A Pitch-aware Approach to Single-channel Speech Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  End-to-end indonesian speech recognition with convolutional and gated recurrent units , 2020, Journal of Physics: Conference Series.

[19]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Jonathan Le Roux,et al.  Deep clustering and conventional networks for music separation: Stronger together , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  M.T. Manzuri,et al.  An improved spectral subtraction speech enhancement system by using an adaptive spectral estimator , 2005, Canadian Conference on Electrical and Computer Engineering, 2005..

[22]  Hongwu Yang,et al.  Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis , 2022, IEEE Access.

[23]  Etienne Barnard,et al.  The NCHLT speech corpus of the South African languages , 2014, SLTU.

[24]  Olivier Pietquin,et al.  A Machine of Few Words - Interactive Speaker Recognition with Reinforcement Learning , 2020, INTERSPEECH.

[25]  Jianhua Tao,et al.  Language-Adversarial Transfer Learning for Low-Resource Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  DeLiang Wang,et al.  A Casa Approach to Deep Learning Based Speaker-Independent Co-Channel Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Satoshi Nakamura,et al.  Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project , 2008, IJCNLP.

[28]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[30]  Ridi Ferdiana,et al.  Indonesian Automatic Speech Recognition system using CMUSphinx toolkit and limited dataset , 2016, 2016 International Symposium on Electronics and Smart Devices (ISESD).

[31]  Sharon Gannot Speech Enhancement: Application of the Kalman Filter in the Estimate-Maximize (EM) Framework , 2005 .

[32]  Dessi Puji Lestari,et al.  Feature-based noise robust speech recognition on an Indonesian language automatic speech recognition system , 2014, 2014 International Conference on Electrical Engineering and Computer Science (ICEECS).

[33]  Heming Huang,et al.  End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer , 2020, IEEE Access.

[34]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Dhany Arifianto,et al.  Development of under-resourced Bahasa Indonesia speech corpus , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[36]  Masato Akagi,et al.  Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model , 2019, 2019 IEEE International Conference on Signals and Systems (ICSigSys).

[37]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[38]  Haizhou Li,et al.  MASS: A Malay language LVCSR corpus resource , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[39]  Enya Kong Tang,et al.  The combined Wordnet Bahasa , 2014 .

[40]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[41]  Ruli Manurung,et al.  A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing , 2012, LAW@ACL.

[42]  Djoko Purwanto,et al.  Development of Indonesian Speech Recognition with Deep Neural Network for Robotic Command , 2019, 2019 International Seminar on Intelligent Technology and Its Applications (ISITIA).

[43]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Sin-Horng Chen,et al.  A preliminary study on cross-language knowledge transfer for low-resource Taiwanese Mandarin ASR , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).