Robot with two ears listens to more than two simultaneous utterances by exploiting harmonic structures

In real-world situations, people often hear more than two simultaneous sounds. For robots, when the number of sound sources exceeds that of sensors, the situation is called under-determined, and robots with two ears need to deal with this situation. Some studies on under-determined sound source separation use L1-norm minimization methods, but the performance of automatic speech recognition with separated speech signals is poor due to its spectral distortion. In this paper, a two-stage separation method to improve separation quality with low computational cost is presented. The first stage uses a L1-norm minimization method in order to extract the harmonic structures. The second stage exploits reliable harmonic structures to maintain acoustic features. Experiments that simulate three utterances recorded by two microphones in an anechoic chamber show that our method improves speech recognition correctness by about three points and is fast enough for real-time separation.

[1]  Daniel W. C. Ho,et al.  Underdetermined blind source separation based on sparse representation , 2006, IEEE Transactions on Signal Processing.

[2]  H. Sawada,et al.  On real and complex valued /spl lscr//sub 1/-norm minimization for overcomplete blind source separation , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[3]  Kenichi Ogawa,et al.  Honda humanoid robots development , 2007, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[4]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[5]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[6]  Yuanqing Li,et al.  Analysis of Sparse Representation and Blind Source Separation , 2004, Neural Computation.

[7]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[8]  Shuuji Kajita,et al.  Development of humanoid robot HRP-3P , 2005, 5th IEEE-RAS International Conference on Humanoid Robots, 2005..

[9]  Terrence J. Sejnowski,et al.  Blind source separation of more sources than mixtures using overcomplete representations , 1999, IEEE Signal Processing Letters.

[10]  H. Ishiguro,et al.  The uncanny advantage of using androids in cognitive and social science research , 2006 .

[11]  Michael Zibulevsky,et al.  Underdetermined blind source separation using sparse representations , 2001, Signal Process..

[12]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .