mmMIC: Multi-modal Speech Recognition based on mmWave Radar

—With the proliferation of voice assistants, microphone-based speech recognition technology usually cannot achieve good performance in the situation of multiple sound sources and ambient noises. In this paper, we propose a novel mmWave-based solution to perform speech recognition to tackle the issues of multiple sound sources and ambient noises, by precisely extracting the multi-modal features from lip motion and vocal-cords vibration from the single channel of mmWave. We propose a difference-based method for feature extraction of lip motion to suppress the dynamic interference from body motion and head motion. We propose a speech detection method based on cross-validation of lip motion and vocal-cords vibration so as to avoid wasting computing resources on nonspeaking activities. We propose a multi-modal fusion framework for speech recognition by fusing the signal features from lip motion and vocal-cords vibration with the attention mechanism. We implemented a prototype system and evaluated the performance in real test-beds. Experiment results show that the average speech recognition accuracy is 92.8% in realistic environments.

[1]  H. Ding,et al.  RF-Wise: Pushing the Limit of RFID-based Sensing , 2022, IEEE INFOCOM 2022 - IEEE Conference on Computer Communications.

[2]  Kui Ren,et al.  Big Brother is Listening: An Evaluation Framework on Ultrasonic Microphone Jammers , 2022, IEEE INFOCOM 2022 - IEEE Conference on Computer Communications.

[3]  Huadong Ma,et al.  WiRa: Enabling Cross-Technology Communication from WiFi to LoRa with IEEE 802.11ax , 2022, IEEE INFOCOM 2022 - IEEE Conference on Computer Communications.

[4]  Wenyao Xu,et al.  mmPhone: Acoustic Eavesdropping on Loudspeakers via mmWave-characterized Piezoelectric Effect , 2022, IEEE INFOCOM 2022 - IEEE Conference on Computer Communications.

[5]  Sanglu Lu,et al.  Thru-the-wall Eavesdropping on Loudspeakers via RFID by Capturing Sub-mm Level Vibration , 2021, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[6]  Kui Ren,et al.  Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals , 2021, SenSys.

[7]  Yingying Chen,et al.  Robust Detection of Machine-induced Audio Attacks in Intelligent Audio Systems with Microphone Array , 2021, CCS.

[8]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Lou Zhao,et al.  Millimeter Wave Integrated Sensing and Communication with Hybrid Architecture in Vehicle to Vehicle Network , 2021, 2021 IEEE 94th Vehicular Technology Conference (VTC2021-Fall).

[10]  Xinyu Zhang,et al.  UltraSE: single-channel speech enhancement using ultrasound , 2021, MobiCom.

[11]  Zhengxiong Li,et al.  VocalPrint: A mmWave-Based Unmediated Vocal Sensing System for Secure Authentication , 2021, IEEE Transactions on Mobile Computing.

[12]  Dorothea Kolossa,et al.  Fusing Information Streams in End-to-End Audio-Visual Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andreas Geiger,et al.  Multi-Modal Fusion Transformer for End-to-End Autonomous Driving , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yunhao Liu,et al.  RED: RFID-Based Eccentricity Detection for High-Speed Rotating Machinery , 2021, IEEE Transactions on Mobile Computing.

[15]  Jean-Baptiste Alayrac,et al.  Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers , 2021, Transactions of the Association for Computational Linguistics.

[16]  Zhengxiong Li,et al.  VocalPrint: exploring a resilient and secure voice authentication via mmWave biometric interrogation , 2020, SenSys.

[17]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[18]  Meng Jin,et al.  mmVib: micrometer-level vibration measurement with mmwave radar , 2020, MobiCom.

[19]  Yingying Chen,et al.  Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems , 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Yong Xu,et al.  Self-Supervised Learning for Audio-Visual Speaker Diarization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jonathan Le Roux,et al.  End-To-End Multi-Speaker Speech Recognition With Transformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Dong Yu,et al.  Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yingying Chen,et al.  Semi-black-box Attacks Against Speech Recognition Systems Using Adversarial Samples , 2019, 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN).

[24]  Wei Huang,et al.  Smart healthcare: making medical care more intelligent , 2019, Global Health Journal.

[25]  Yan Li,et al.  The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Ting Liu,et al.  RF-Mehndi: A Fingertip Profiled RF Identifier , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[27]  Yuan He,et al.  RED: RFID-based Eccentricity Detection for High-speed Rotating Machinery , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Guy Hoffman,et al.  Comparing Social Robot, Screen and Voice Interfaces for Smart-Home Control , 2017, CHI.

[30]  Mo Li,et al.  Precise Power Delay Profiling with Commodity Wi-Fi , 2015, IEEE Transactions on Mobile Computing.

[31]  Kaishun Wu,et al.  We Can Hear You with Wi-Fi! , 2014, IEEE Transactions on Mobile Computing.

[32]  Frédo Durand,et al.  The visual microphone , 2014, ACM Trans. Graph..

[33]  R. Sataloff,et al.  The human voice. , 1992, Scientific American.

[34]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[35]  Chris Xiaoxuan Lu,et al.  SpiralSpy: Exploring a Stealthy and Practical Covert Channel to Attack Air-gapped Computing Devices via mmWave Sensing , 2022, NDSS.

[36]  Yuedong Xu,et al.  $M^3$: Multipath Assisted Wi-Fi Localization with a Single Access Point , 2019, IEEE Transactions on Mobile Computing.

[37]  W. Freeman,et al.  The Visual Microphone: Passive Recovery of Sound from Video , 2014 .

[38]  Alejandro Acero,et al.  Frequency Domain Processing , 1993 .