Target speech feature extraction using non-parametric correlation coefficient

Speech recognition systems for the automobile have a few weaknesses, including failure to recognize speech due to the mixing of environment noise from inside and outside the car and from other voices. Therefore, this paper features a technique for extracting only the selected target voice from input sound that is a mixture of voices and noises. The feature for selective speech extraction composes a correlation map of auditory elements by using similarity between channels and continuity of time, and utilizes a method of extracting speech features by using a non-parametric correlation coefficient. This proposed method was validated by showing that the average distortion of separation of the technique decreased by 0.8630 dB. It was shown that the performance of the selective feature extraction utilizing a cross correlation is good, but overall, the selective feature extraction utilizing a non-parametric correlation is better.

[1]  Jin Young Kim,et al.  A Robust Lip Center Detection in Cell Phone Environment , 2008, 2008 IEEE International Symposium on Signal Processing and Information Technology.

[2]  Soo-Young Lee,et al.  Load balancing on temporally heterogeneous cluster of workstations for parallel simulated annealing , 2011, Cluster Computing.

[3]  Kyung-Yong Chung,et al.  Ontology-based healthcare context information model to implement ubiquitous environment , 2014, Multimedia Tools and Applications.

[4]  Hossam S. Hassanein,et al.  A smart spaces architecture based on heterogeneous contexts, particularly social contexts , 2012, Cluster Computing.

[5]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[6]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[7]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[8]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[10]  Jung-Hyun Lee,et al.  Development of head detection and tracking systems for visual surveillance , 2013, Personal and Ubiquitous Computing.

[11]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[12]  Jeff A. Bilmes,et al.  DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Kyung-Yong Chung,et al.  Context and profile based cascade classifier for efficient people detection and safety care system , 2012, Multimedia Tools and Applications.

[14]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[15]  Kyung-Yong Chung,et al.  Item recommendation based on context-aware model for personalized u-healthcare service , 2011, Multimedia Tools and Applications.

[16]  Kyung-Yong Chung,et al.  Recent trends on mobile computing and future networks , 2013, Personal and Ubiquitous Computing.

[17]  Ana Busic,et al.  Monotonicity and performance evaluation: applications to high speed and mobile networks , 2011, Cluster Computing.

[18]  CookeMartin,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001 .

[19]  Juergen Luettin,et al.  Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[20]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[21]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.