论文信息 - A Psychoacoustic Engineering Approach to Machine Sound Source Separation in Reverberant Environments

A Psychoacoustic Engineering Approach to Machine Sound Source Separation in Reverberant Environments

Reverberation continues to present a major problem for sound source separation algorithms, due to its corruption of many of the acoustical cues on which these algorithms rely. However, humans demonstrate a remarkable robustness to reverberation and many psychophysical and perceptual mechanisms are well documented. This thesis therefore considers the research question: can the reverberation–performance of existing psychoacoustic engineering approaches to machine source separation be improved? The precedence effect is a perceptual mechanism that aids our ability to localise sounds in reverberant environments. Despite this, relatively little work has been done on incorporating the precedence effect into automated sound source separation. Consequently, a study was conducted that compared several computational precedence models and their impact on the performance of a baseline separation algorithm. The algorithm included a precedence model, which was replaced with the other precedence models during the investigation. The models were tested using a novel metric in a range of reverberant rooms and with a range of other mixture parameters. The metric, termed Ideal Binary Mask Ratio, is shown to be robust to the effects of reverberation and facilitates meaningful and direct comparison between algorithms across different acoustic conditions. Large differences between the performances of the models were observed. The results showed that a separation algorithm incorporating a model based on interaural coherence produces the greatest performance gain over the baseline algorithm. The results from the study also indicated that it may be necessary to adapt the precedence model to the acoustic conditions in which the model is utilised. This effect is analogous to the perceptual Clifton effect, which is a dynamic component of the precedence effect that appears to adapt precedence to a given acoustic environment in order to maximise its effectiveness. However, no work has been carried out on adapting a precedence model to the acoustic conditions under test. Specifically, although the necessity for such a component has been suggested in the literature, neither its necessity nor benefit has been formally validated. Consequently, a further study was conducted in which parameters of each of the previously compared precedence models were varied in each room in order to identify if, and to what extent, the separation performance varied with these parameters. The results showed that the reverberation–performance of existing psychoacoustic engineering approaches to machine source separation can be improved and can yield significant gains in separation performance.

Christopher Hummersone | Christopher Hummersone

[1] Laurent Couvreur,et al. Blind Model Selection for Automatic Speech Recognition in Reverberant Environments , 2004, J. VLSI Signal Process..

[2] N. Suga,et al. Neural basis of amplitude-spectrum representation in auditory cortex of the mustached bat. , 1982, Journal of neurophysiology.

[3] Hynek Hermansky,et al. RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[4] Chen Yang,et al. Static and Dynamic Spectral Features: Their Noise Robustness and Optimal Weights for ASR , 2005, IEEE Transactions on Audio, Speech, and Language Processing.

[5] E. B. Newman,et al. The precedence effect in sound localization. , 1949, The American journal of psychology.

[6] Hynek Hermansky,et al. Study on the dereverberation of speech based on temporal envelope filtering , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7] William A. Yost,et al. Spatial hearing: The psychophysics of human sound localization, revised edition , 1998 .

[8] David Marr,et al. VISION A Computational Investigation into the Human Representation and Processing of Visual Information , 2009 .

[9] S. A. Shamma,et al. Spectral Gradient Columns in Primary Auditory Cortex: Physiological and Psychoacoustical Correlates , 1991 .

[10] J. Moncur,et al. Binaural and monaural speech intelligibility in reverberation. , 1967, Journal of speech and hearing research.

[11] Douglas L. Jones,et al. Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms. , 2004, The Journal of the Acoustical Society of America.

[12] Ning Ma,et al. A speech fragment approach to localising multiple speakers in reverberant environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Guy J. Brown,et al. Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[14] A. Oppenheim,et al. Nonlinear filtering of multiplied and convolved signals , 1968 .

[15] B C Wheeler,et al. Localization of multiple sound sources with two microphones. , 2000, The Journal of the Acoustical Society of America.

[16] J. Makhoul,et al. Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[17] S. Gelfand,et al. Effects of small room reverberation upon the recognition of some consonant features , 1979 .

[18] Ruth Y Litovsky,et al. Localization dominance in the median-sagittal plane: effect of stimulus duration. , 2004, The Journal of the Acoustical Society of America.

[19] E. de Boer,et al. On ringing limits of the auditory periphery , 2004, Biological Cybernetics.

[20] Les E. Atlas,et al. Acoustic diversity for improved speech recognition in reverberant environments , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21] B. Grothe,et al. Precise inhibition is essential for microsecond interaural time difference coding , 2002, Nature.

[22] L A JEFFRESS,et al. A place theory of sound localization. , 1948, Journal of comparative and physiological psychology.

[23] M Haggard,et al. Selectivity for distortions and words in speech perception. , 1974, British journal of psychology.

[24] T Sone,et al. On the perception of direction of echo. , 1968, The Journal of the Acoustical Society of America.

[25] Douglas L. Jones,et al. Localization-based grouping , 2006 .

[26] D. Deutsch. Two-channel listening to musical scales. , 1975, The Journal of the Acoustical Society of America.

[27] Terrence J. Sejnowski,et al. An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[28] Guy J. Brown,et al. A blackboard architecture for computational auditory scene analysis , 1999, Speech Commun..

[29] B. P. Bogert,et al. The quefrency analysis of time series for echoes : cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking , 1963 .

[30] C. Schreiner,et al. Periodicity coding in the inferior colliculus of the cat. II. Topographical organization. , 1988, Journal of neurophysiology.

[31] H. Gaskell. The precedence effect , 1983, Hearing Research.

[32] R K Clifton. Breakdown of echo suppression in the precedence effect. , 1987, The Journal of the Acoustical Society of America.

[33] Willard R. Thurlow,et al. Precedence-Suppression Effects for Two Click Sources , 1961 .

[34] D.P. Skinner,et al. The cepstrum: A guide to processing , 1977, Proceedings of the IEEE.

[35] DeLiang Wang,et al. Binaural sound segregation for multisource reverberant environments , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36] M S Brandstein. Time-delay estimation of reverberated speech exploiting harmonic structure. , 1999, The Journal of the Acoustical Society of America.

[37] Ning Ma,et al. Speech fragment decoding techniques for simultaneous speaker identification and speech recognition , 2010, Comput. Speech Lang..

[38] D. Banks. Localisation and separation of simultaneous voices with two microphones , 1993 .

[39] Bayya Yegnanarayana,et al. Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[40] C. Faller,et al. Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[41] DeLiang Wang,et al. A one-microphone algorithm for reverberant speech enhancement , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[42] Mikio Tohyama,et al. Source waveform recovery in a reverberant space by cepstrum dereverberation , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43] L. V. Noorden. Temporal coherence in the perception of tone sequences , 1975 .

[44] Ning Ma,et al. Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[45] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[46] Tim Brookes,et al. Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[47] Richard F. Lyon,et al. Auditory model inversion for sound separation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[48] Steven M. Kay,et al. Cochannel speaker separation by harmonic enhancement and suppression , 1997, IEEE Trans. Speech Audio Process..

[49] John F. Culling,et al. Effects of simulated reverberation on the use of binaural cues and fundamental-frequency differences for separating concurrent vowels , 1994, Speech Commun..

[50] Marc Moonen,et al. Assessment of dereverberation algorithms for large vocabulary speech recognition systems , 2003, INTERSPEECH.

[51] Maurizio Omologo,et al. Experiments of speech recognition in a noisy and reverberant environment using a microphone array and HMM adaptation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[52] R. K. Clifton,et al. Dynamic processes in the precedence effect. , 1991, The Journal of the Acoustical Society of America.

[53] Jonas Braasch,et al. Modelling of Binaural Hearing , 2005 .

[54] Alain de Cheveigné,et al. Speech f0 extraction based on Licklider's pitch perception model , 1991 .

[55] B H Repp,et al. On the possible role of auditory short-term adaptation in perception of the prevocalic [m]-[n] contrast. , 1987, The Journal of the Acoustical Society of America.

[56] Nelson Morgan,et al. Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments , 1998 .

[57] Richard M. Stern,et al. Missing Feature Speech Recognition using Dereverberation and Echo Suppression in Reverberant Environments , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[58] Richard F. Lyon,et al. A perceptual pitch detector , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[59] B. Juang,et al. Harmonicity based dereverberation with maximum a posteriori estimation , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[60] Steven van de Par,et al. The normalized correlation: Accounting for NoSπ thresholds with Gaussian and ‘‘low‐noise’’ masking noise , 1999 .

[61] R. H. Bolt,et al. Theory of Speech masking by reverberation , 1949 .

[62] B C Wheeler,et al. A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers. , 2001, The Journal of the Acoustical Society of America.

[63] R. Meddis. Simulation of mechanical to neural transduction in the auditory receptor. , 1986, The Journal of the Acoustical Society of America.

[64] Ewan A. Macpherson,et al. A Computer Model of Binaural Localization for Stereo Imaging Measurement , 1989 .

[65] Tomohiro Nakatani,et al. Harmonic sound stream segregation using localization and its application to speech stream segregation , 1999, Speech Commun..

[66] Richard M. Stern,et al. Efficient Cepstral Normalization for Robust Speech Recognition , 1993, HLT.

[67] B. Moore. An Introduction to the Psychology of Hearing: Sixth Edition , 2012 .

[68] R L Freyman,et al. Effect of click rate and delay on breakdown of the precedence effect. , 1987, Perception & psychophysics.

[69] Jean Rouat,et al. A pitch determination and voiced/unvoiced decision algorithm for noisy speech , 1995, Speech Commun..

[70] P M Zurek,et al. The precedence effect and its possible role in the avoidance of interaural ambiguities. , 1980, The Journal of the Acoustical Society of America.

[71] Martin Cooke,et al. Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[72] Steven Greenberg,et al. Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[73] S M Abel,et al. Sound localization: effects of reverberation time, speaker array, stimulus frequency, and stimulus rise/decay. , 1993, The Journal of the Acoustical Society of America.

[74] John Mourjopoulos,et al. Real-Time Room Equalization Based on Complex Smoothing: Robustness Results , 2004 .

[75] Jon Barker,et al. An automatic speech recognition system based on the scene analysis account of auditory perception , 2007, Speech Commun..

[76] DeLiang Wang,et al. A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[77] T. Houtgast,et al. A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[78] DeLiang Wang,et al. Sequential Organization of Speech in Reverberant Environments by Integrating Monaural Grouping and Binaural Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[79] M. Schroeder. Period histogram and product spectrum: new methods for fundamental-frequency measurement. , 1968, The Journal of the Acoustical Society of America.

[80] T. W. Parsons. Separation of speech from interfering speech by means of harmonic selection , 1976 .

[81] B. Atal. Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[82] Hiroshi Sawada,et al. Overcomplete BSS for Convolutive Mixtures Based on Hierarchical Clustering , 2004, ICA.

[83] Richard F. Lyon. A computational model of binaural localization and separation , 1983, ICASSP.

[84] Terrence J. Sejnowski,et al. Blind source separation of more sources than mixtures using overcomplete representations , 1999, IEEE Signal Processing Letters.

[85] Brian R Glasberg,et al. Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[86] DeLiang Wang,et al. Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[87] Ken'ichi Furuya,et al. Real-time source separation based on sound localization in a reverberant environment , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[88] S. van de Par,et al. The normalized interaural correlation: accounting for NoS pi thresholds obtained with Gaussian and "low-noise" masking noise. , 1999, The Journal of the Acoustical Society of America.

[89] Guy J. Brown,et al. Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[90] Ruth Y. Litovsky,et al. Positional dependence on localization dominance in the median‐sagittal plane , 1997 .

[91] Laurie R. Fincham. Refinements in the Impulse Testing of Loudspeakers , 1985 .

[92] Martin F. Schlang,et al. An auditory based approach for echo compensation with modulation filtering , 1989, EUROSPEECH.

[93] B. Kollmeier,et al. Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[94] L. Auger. The Journal of the Acoustical Society of America , 1949 .

[95] G. F. Kuhn. Model for the interaural time differences in the azimuthal plane , 1977 .

[96] R. Plomp,et al. Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[97] Andrew W. Fitzgibbon,et al. An Experimental Comparison of Range Image Segmentation Algorithms , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[98] D W Grantham,et al. Left-right asymmetry in the buildup of echo suppression in normal-hearing adults. , 1996, The Journal of the Acoustical Society of America.

[99] Barbara G. Shinn-Cunningham,et al. PERCEPTUAL CONSENQUECES OF INCLUDING REVERBERATION IN SPATIAL AUDITORY DISPLAYS , 2003 .

[100] Guy J. Brown,et al. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[101] Tim Brookes,et al. Ideal Binary Mask Ratio: A Novel Metric for Assessing Binary-Mask-Based Sound Source Separation Algorithms , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[102] W. Hartmann. Localization of sound in rooms. , 1983, The Journal of the Acoustical Society of America.

[103] N. Durlach. Equalization and Cancellation Theory of Binaural Masking‐Level Differences , 1963 .

[104] Guy J. Brown,et al. Missing data speech recognition in reverberant conditions , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[105] Daniel Patrick Whittlesey Ellis,et al. Prediction-driven computational auditory scene analysis , 1996 .

[106] Stuart Gatehouse,et al. Perceptual segregation of competing speech sounds: the role of spatial location. , 1999, The Journal of the Acoustical Society of America.

[107] E D Schubert,et al. Envelope versus microstructure in the fusion of dichotic signals. , 1969, The Journal of the Acoustical Society of America.

[108] Guy J. Brown,et al. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation , 2004, Speech Commun..

[109] Masashi Unoki,et al. Robust and accurate F0 estimation for reverberant speech by utilizing complex cepstrum analysis , 2007 .

[110] Bill Gardner,et al. HRTF Measurements of a KEMAR Dummy-Head Microphone , 1994 .

[111] W. Koenig,et al. Subjective Effects in Binaural Hearing , 1950 .

[112] J. Culling,et al. Perceptual separation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay. , 1995, The Journal of the Acoustical Society of America.

[113] Peter H. Rogers,et al. Human capabilities of dereverberation , 2000 .

[114] Keith D. Martin. Echo suppression in a computational model of the precedence effect , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[115] A J King,et al. Spatial response properties of acoustically responsive neurons in the superior colliculus of the ferret: a map of auditory space. , 1987, Journal of neurophysiology.

[116] Sandra J. Guzman,et al. Auditory Processing of Sound Sources , 1996 .

[117] A. J. Watkins. Central, auditory mechanisms of perceptual compensation for spectral-envelope distortion. , 1991, The Journal of the Acoustical Society of America.

[118] Michael Kleinschmidt. IMPORTANCE OF EARLY AND LATE REFLECTIONS FOR AUTOMATIC SPEECH RECOGNITION IN REVERBERANT ENVIRONMENTS , 2003 .

[119] DeLiang Wang,et al. Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[120] Brad Libbey,et al. The effect of overlap-masking on binaural reverberant word intelligibility. , 2004, The Journal of the Acoustical Society of America.

[121] P. N. Denbigh,et al. A sound segregation algorithm for reverberant conditions , 2001, Speech Commun..

[122] R. Meddis,et al. Implementation details of a computation model of the inner hair‐cell auditory‐nerve synapse , 1990 .

[123] DeLiang Wang,et al. Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[124] C. Cherry,et al. On human communication , 1966 .

[125] J. Pickles. An Introduction to the Physiology of Hearing , 1982 .

[126] Daniel P. W. Ellis,et al. Evaluating Source Separation Algorithms With Reverberant Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[127] R. Kumaresan,et al. Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications , 1999 .

[128] DeLiang Wang,et al. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[129] W. Lindemann. Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization for stationary signals. , 1986, The Journal of the Acoustical Society of America.

[130] John F Culling,et al. Trading of intensity and interaural coherence in dichotic pitch stimuli. , 2010, The Journal of the Acoustical Society of America.

[131] Yehuda Albeck. Sound localization and binaural processing , 1998 .

[132] D. D. Greenwood. Critical Bandwidth and the Frequency Coordinates of the Basilar Membrane , 1961 .

[133] P. Loizou,et al. Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. , 2008, The Journal of the Acoustical Society of America.

[134] K Aikawa,et al. Cepstral representation of speech motivated by time-frequency masking: an application to speech recognition. , 1996, The Journal of the Acoustical Society of America.

[135] James H. Martin,et al. Speech and Language Processing An Introduction to Natural Language Processing , Computational Linguistics , and Speech Recognition Second Edition , 2008 .

[136] Kuansan Wang,et al. Spectral shape analysis in the central auditory system , 1995, IEEE Trans. Speech Audio Process..

[137] R. Patterson,et al. B OF THE SVOS FINAL REPORT ( Part A : The Auditory Filterbank ) AN EFFICIENT AUDITORY FIL TERBANK BASED ON THE GAMMATONE FUNCTION , 2010 .

[138] Ning Ma,et al. Integrating pitch and localisation cues at a speech fragment level , 2007, INTERSPEECH.

[139] Guy J. Brown,et al. A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[140] Chaz Yee Toh,et al. Effects of reverberation on perceptual segregation of competing voices. , 2003, The Journal of the Acoustical Society of America.

[141] D. Grantham,et al. Cross-spectral and temporal factors in the precedence effect: discrimination suppression of the lag sound in free-field. , 1997, The Journal of the Acoustical Society of America.

[142] R W Hukin,et al. Effects of reverberation on spatial, prosodic, and vocal-tract size cues to selective attention. , 2000, The Journal of the Acoustical Society of America.

[143] J. Licklider,et al. A duplex theory of pitch perception , 1951, Experientia.

[144] Douglas L. Jones,et al. Beamforming with collocated microphone arrays , 2003 .

[145] E. B. Newman,et al. A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[146] T. Langhans,et al. Speech enhancement by nonlinear multiband envelope filtering , 1982, ICASSP.

[147] Stephanie Seneff. Pitch and spectral estimation of speech based on auditory synchrony model , 1984, ICASSP.

[148] Stephanie Seneff,et al. Pitch and spectral estimation of speech based on auditory synchrony model , 1983, ICASSP.

[149] M. Tohyama,et al. Blind dereverberation using short‐time cepstrum frame subtraction , 1999 .

[150] Tomohiro Nakatani,et al. One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals , 2003, NIPS.

[151] John F Culling,et al. The spatial unmasking of speech: evidence for within-channel processing of interaural time delay. , 2005, The Journal of the Acoustical Society of America.

[152] Patrick A. Naylor,et al. Speech Dereverberation , 2010 .

[153] T Houtgast,et al. A physical method for measuring speech-transmission quality. , 1980, The Journal of the Acoustical Society of America.

[154] N. Sutherland,et al. Grouping Frequency Components of Vowels: When is a Harmonic not a Harmonic? , 1984 .

[155] T. Yin,et al. Psychophysical and physiological evidence for a precedence effect in the median sagittal plane. , 1997, Journal of neurophysiology.

[156] M. Bodden. Modeling human sound-source localization and the cocktail-party-effect , 1993 .

[157] Mitchel Weintraub,et al. A theory and computational model of auditory monaural sound separation , 1985 .

[158] J. Pickles. An Introduction to the Physiology of Hearing, Second Edition , 1988 .

[159] P M Zurek,et al. Adjustment and discrimination measurements of the precedence effect. , 1993, The Journal of the Acoustical Society of America.

[160] Guy J. Brown,et al. Computational auditory scene analysis , 1994, Comput. Speech Lang..

[161] Hideki Kawahara,et al. YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[162] Richard L Freyman,et al. Auditory target detection in reverberation. , 2004, The Journal of the Acoustical Society of America.

[163] Q. Summerfield,et al. Auditory enhancement of changes in spectral amplitude. , 1987, The Journal of the Acoustical Society of America.

[164] Mark A. Clements,et al. A Computationally Compact Divergence Measure for Speech Processing , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[165] DeLiang Wang,et al. On the optimality of ideal binary time-frequency masks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.