Enabling Real-Time On-Chip Audio Super Resolution for Bone-Conduction Microphones

Voice communication using the air conduction microphone in noisy environments suffers from the degradation of speech audibility. Bone conduction microphones (BCM) are robust against ambient noises but suffer from limited effective bandwidth due to their sensing mechanism. Although existing audio super resolution algorithms can recover the high frequency loss to achieve high-fidelity audio, they require considerably more computational resources than available in low-power hearable devices. This paper proposes the first-ever real-time on-chip speech audio super resolution system for BCM. To accomplish this, we built and compared a series of lightweight audio super resolution deep learning models. Among all these models, ATS-UNet is the most cost-efficient because the proposed novel Audio Temporal Shift Module (ATSM) reduces the network’s dimensionality while maintaining sufficient temporal features from speech audios. Then we quantized and deployed the ATS-UNet to low-end ARM micro-controller units for real-time embedded prototype. Evaluation results show that our system achieved real-time inference speed on Cortex-M7 and higher quality than the baseline audio super resolution method. Finally, we conducted a user study with ten experts and ten amateur listeners to evaluate our method’s effectiveness to human ears. Both groups perceived a significantly higher speech quality with our method when compared to the solutions with the original BCM or air conduction microphone with cutting-edge noise reduction algorithms.

[1]  SplitSR , 2021 .

[2]  Yu Tsao,et al.  Time-Domain Multi-Modal Bone/Air Conducted Speech Enhancement , 2019, IEEE Signal Processing Letters.

[3]  A. Finkelstein,et al.  HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks , 2020, INTERSPEECH.

[4]  Yu Tsao,et al.  Bone-Conducted Speech Enhancement Using Hierarchical Extreme Learning Machine , 2019, IWSDS.

[5]  Evangelos D. Spyrou,et al.  Emotion Recognition from Speech: A Survey , 2019, WEBIST.

[6]  Hamid Amiri,et al.  About Multichannel Speech Signal Extraction and Separation Techniques , 2012, ArXiv.

[7]  M. Anand “1984” , 1962 .

[8]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[9]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Yu Tsao,et al.  Bone-conducted speech enhancement using deep denoising autoencoder , 2018, Speech Commun..

[11]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[12]  S. Hewitt,et al.  1977 , 1977, Kuwait 1975/76 - 2019.

[13]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[14]  Ryan M. Corey,et al.  Acoustic effects of medical, cloth, and transparent face masks on speech signals , 2020, The Journal of the Acoustical Society of America.

[15]  S. Hewitt,et al.  1979 , 1979, Salon Salon.

[16]  Xinyu Li,et al.  Speech Audio Super-Resolution for Speech Recognition , 2019, INTERSPEECH.

[17]  Yi Zhou,et al.  A Real-Time Dual-Microphone Speech Enhancement Algorithm Assisted by Bone Conduction Sensor , 2020, Sensors.

[18]  Denis Gifford,et al.  1976 , 2018, The British Film Catalogue.

[19]  Tillman Weyde,et al.  Improved Speech Enhancement with the Wave-U-Net , 2018, ArXiv.

[20]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Dong Yu,et al.  Speech Super-Resolution Using Parallel WaveNet , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[22]  J.B. Allen,et al.  A unified approach to short-time Fourier analysis and synthesis , 1977, Proceedings of the IEEE.

[23]  Chng Eng Siong,et al.  Time-Domain Neural Network Approach for Speech Bandwidth Extension , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Adam Finkelstein,et al.  Learning Bandwidth Expansion Using Perceptually-motivated Loss , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  2001 , 2018, Wild Onion Nurse.

[26]  Jérémie Voix,et al.  In-ear microphone speech quality enhancement via adaptive filtering and artificial bandwidth extension. , 2017, The Journal of the Acoustical Society of America.

[27]  Visvesh Sathe,et al.  Bandwidth Extension on Raw Audio via Generative Adversarial Networks , 2019, ArXiv.

[28]  Bhaskar D. Rao,et al.  Bone-Conduction Sensor Assisted Noise Estimation for Improved Speech Enhancement , 2018, INTERSPEECH.

[29]  T. Shimamura,et al.  A reconstruction filter for bone-conducted speech , 2005, 48th Midwest Symposium on Circuits and Systems, 2005..

[30]  Hong-Goo Kang,et al.  Survey of Speech Enhancement Supported by a Bone Conduction Microphone , 2012, ITG Conference on Speech Communication.

[31]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[32]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[33]  Zicheng Liu,et al.  Direct filtering for air- and bone-conductive microphones , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[34]  T. Overton 1972 , 1972, Parables of Sun Light.

[35]  Thomas C. Walters,et al.  Speech Bandwidth Extension with Wavenet , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[36]  1937 , 2000, Camden Fifth Series.

[37]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[38]  T. Shimamura,et al.  Improving Bone-Conducted Speech Quality via Neural Network , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[39]  Chin-Hui Lee,et al.  A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Vikas Chandra,et al.  CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs , 2018, ArXiv.

[41]  Tomohiro Nakatani,et al.  A Unified Framework for Neural Speech Separation and Extraction , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  J. Capon High-resolution frequency-wavenumber spectrum analysis , 1969 .

[43]  Zhiyao Duan,et al.  Adversarial Training for Speech Super-Resolution , 2019, IEEE Journal of Selected Topics in Signal Processing.

[44]  양희영 2005 , 2005, Los 25 años de la OMC: Una retrospectiva fotográfica.

[45]  Li Li,et al.  A Novel Encoder-Decoder Model via NS-LSTM Used for Bone-Conducted Speech Enhancement , 2018, IEEE Access.

[46]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[47]  Denis Gifford 1969 , 2018, The British Film Catalogue.

[48]  Tetsuya Shimamura,et al.  Intelligibility enhancement of bone conducted speech by an analysis-synthesis method , 2011, 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS).

[49]  Kundan Kumar,et al.  NU-GAN: High resolution neural upsampling with GAN , 2020, ArXiv.

[50]  Meng Sun,et al.  Improving the Spectra Recovering of Bone-Conducted Speech via Structural SIMilarity Loss Function , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[51]  William K. Pratt,et al.  Generalized Wiener Filtering Computation Techniques , 1972, IEEE Transactions on Computers.

[52]  Tomasz Letowski,et al.  The effect of bone conduction microphone locations on speech intelligibility and sound quality. , 2011, Applied ergonomics.

[53]  Minh N. Do,et al.  Time-Frequency Networks for Audio Super-Resolution , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Oleg Rybakov,et al.  Real-Time Speech Frequency Bandwidth Extension , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Pang Wei Koh,et al.  Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations , 2019, NeurIPS.

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Tomoki Toda,et al.  Self-Produced Speech Enhancement and Suppression Method using Air- and Body-Conductive Microphones , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).