Accurate marginalization range for missing data recognition

Abstract Missing data recognition has been proposed to increase noiserobustness of automatic speech recognition. This strategy reliesontheuseofaspectrographicmaskthatgivesinformationaboutthetruecleanspeechenergyofacorruptedsignal. Thisinforma-tion is then used to refine the data process during the decodingstep. We propose in this work a new mask that provides moreinformation about the clean speech contribution than classicalmasks based on a Signal to Noise Ratio (SNR) thresholding.The proposed mask is described and compared to another miss-ing data approach based on SNR thresholding. Experimentalresults show a significant word error rate reduction induced bythe proposed approach. Moreover, the proposed mask outper-forms the ETSI advanced front-end on the HIWIRE corpus. IndexTerms : robustspeechrecognition,missingdata,boundedmarginalization 1. Introduction The presence of background noise typically causes mismatchesbetween training and testing conditions, which significantly de-gradetheperformanceofautomaticspeechrecognizers(ASRs).Over the last decades, many solutions to reduce the effect ofnoise have been proposed. Acoustic models can be adapted tonewnoisyconditions,theanalysisfront-endcanbemaderobustto noise, and noise reduction algorithms can be used as prepro-cessing stages.Although many of these methods have shown superior per-formance in noisy conditions compared to standard speechrecognition, noise robustness is still a challenging issue fornowadays speech recognizers, especially for non-sationarynoise.More recently, speech recognition with missing data hasbeen proposed. This technique relies on a clustering of spectralfeatures into two classes: time-frequency (T-F) units of a noisyspeechsignalthatcontainmorespeechenergythannoiseenergyare classified as reliable data, while T-F units containing morenoiseenergyareclassifiedasmissingdata. Hence, theresultingclustering produces a binary mask that is exploited in missingdata recognition techniques [1].