论文信息 - Statistical models for natural sounds

Statistical models for natural sounds

It is important to understand the rich structure of natural sounds in order to solve important tasks, like automatic speech recognition, and to understand auditory processing in the brain. This thesis takes a step in this direction by characterising the statistics of simple natural sounds. We focus on the statistics because perception often appears to depend on them, rather than on the raw waveform. For example the perception of auditory textures, like running water, wind, fire and rain, depends on summary-statistics, like the rate of falling rain droplets, rather than on the exact details of the physical source. In order to analyse the statistics of sounds accurately it is necessary to improve a number of traditional signal processing methods, including those for amplitude demodulation, time-frequency analysis, and sub-band demodulation. These estimation tasks are ill-posed and therefore it is natural to treat them as Bayesian inference problems. The new probabilistic versions of these methods have several advantages. For example, they perform more accurately on natural signals and are more robust to noise, they can also fill-in missing sections of data, and provide error-bars. Furthermore, free-parameters can be learned from the signal. Using these new algorithms we demonstrate that the energy, sparsity, modulation depth and modulation time-scale in each sub-band of a signal are critical statistics, together with the dependencies between the sub-band modulators. In order to validate this claim, a model containing co-modulated coloured noise carriers is shown to be capable of generating a range of realistic sounding auditory textures. Finally, we explored the connection between the statistics of natural sounds and perception. We demonstrate that inference in the model for auditory textures qualitatively replicates the primitive grouping rules that listeners use to understand simple acoustic scenes. This suggests that the auditory system is optimised for the statistics of natural sounds.

Richard E. Turner

[1] Zachary M. Smith,et al. Chimaeric sounds reveal dichotomies in auditory perception , 2002, Nature.

[2] A. O'Hagan,et al. Bayes–Hermite quadrature , 1991 .

[3] R. T. Cox. The Algebra of Probable Inference , 1962 .

[4] R. Voss,et al. ‘1/fnoise’ in music and speech , 1975, Nature.

[5] R. Patterson,et al. Time-domain modeling of peripheral auditory processing: a modular architecture and a software platform. , 1995, The Journal of the Acoustical Society of America.

[6] Mark Haggard,et al. Release from masking through ipsilateral and contralateral comodulation of a flanking band , 1984 .

[7] István Winkler,et al. Units of sound representation and temporal integration: A mismatch negativity study , 2008, Neuroscience Letters.

[8] Richard E. Turner,et al. Probabilistic Amplitude Demodulation , 2007, ICA.

[9] Martin J. Wainwright,et al. Scale Mixtures of Gaussians and the Statistics of Natural Images , 1999, NIPS.

[10] P. Loughlin,et al. On the amplitude‐ and frequency‐modulation decomposition of signals , 1996 .

[11] A Kohlrausch,et al. Intrinsic envelope fluctuations and modulation-detection thresholds for narrow-band noise carriers. , 1999, The Journal of the Acoustical Society of America.

[12] C. Micheyl,et al. The Neurophysiological Basis of the Auditory Continuity Illusion: A Mismatch Negativity Study , 2003 .

[13] Tom Minka,et al. A family of algorithms for approximate Bayesian inference , 2001 .

[14] Steven Greenberg,et al. Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[15] P C Loizou,et al. On the number of channels needed to understand speech. , 1999, The Journal of the Acoustical Society of America.

[16] Guy J. Brown,et al. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[17] P. A. Blight. The Analysis of Time Series: An Introduction , 1991 .

[18] Aapo Hyvärinen,et al. Natural Image Statistics - A Probabilistic Approach to Early Computational Vision , 2009, Computational Imaging and Vision.

[19] Richard E. Turner,et al. A Maximum-Likelihood Interpretation for Slow Feature Analysis , 2007, Neural Computation.

[20] J H Grose,et al. Across-frequency processing of multiple modulation patterns. , 1996, The Journal of the Acoustical Society of America.

[21] S.J. Godsill,et al. Efficient variational inference for the dynamic harmonic model , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[22] S. Shamma,et al. An account of monaural phase sensitivity. , 2002, The Journal of the Acoustical Society of America.

[23] J. B. Pickering,et al. Vowel Perception and Production , 1994 .

[24] Yuan Qi,et al. Bayesian spectrum estimation of unevenly sampled nonstationary data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25] N. C. Singh,et al. Modulation spectra of natural sounds and ethological theories of auditory processing. , 2003, The Journal of the Acoustical Society of America.

[26] R. Carlyon. Detecting coherent and incoherent frequency modulation , 2000, Hearing Research.

[27] Matthew J. Beal. Variational algorithms for approximate Bayesian inference , 2003 .

[28] C E Schreiner,et al. Neural processing of amplitude-modulated sounds. , 2004, Physiological reviews.

[29] J. Dugundji,et al. Envelopes and pre-envelopes of real waveforms , 1958, IRE Trans. Inf. Theory.

[30] Hervé Bourlard,et al. Mel-cepstrum modulation spectrum (MCMS) features for robust ASR , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[31] R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation , 1982 .

[32] Tomi Kinnunen,et al. Joint Acoustic-Modulation Frequency for Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[33] J. L. Flanagan,et al. PHASE VOCODER , 2008 .

[34] R. Baierlein. Probability Theory: The Logic of Science , 2004 .

[35] G. Casella,et al. Rao-Blackwellisation of sampling schemes , 1996 .

[36] Hynek Hermansky,et al. The challenge of inverse-E: the RASTA-PLP method , 1991, [1991] Conference Record of the Twenty-Fifth Asilomar Conference on Signals, Systems & Computers.

[37] Lucas C. Parra,et al. Convolutive Blind Source Separation Methods , 2008 .

[38] Martin Cooke,et al. A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[39] W A Yost,et al. Across-critical-band processing of amplitude-modulated tones. , 1989, The Journal of the Acoustical Society of America.

[40] Terrence J. Sejnowski,et al. The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[41] Albert S. Bregman,et al. The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[42] L. F. Willems,et al. Measurement of pitch in speech: an implementation of Goldstein's theory of pitch perception. , 1982, The Journal of the Acoustical Society of America.

[43] Robert L. Libbey. Signal and image processing sourcebook , 1994 .

[44] J. L. Flanagan,et al. Parametric coding of speech spectra , 1980 .

[45] R. Goebel,et al. Hearing Illusory Sounds in Noise: Sensory-Perceptual Transformations in Primary Auditory Cortex , 2007, The Journal of Neuroscience.

[46] Michael I. Jordan,et al. An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[47] Rajesh P. N. Rao,et al. Probabilistic Models of the Brain: Perception and Neural Function , 2002 .

[48] Terrence J. Sejnowski,et al. Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[49] B C Moore,et al. Effects of relative phase and frequency spacing on the detection of three-component amplitude modulation. , 2000, The Journal of the Acoustical Society of America.

[50] M. Lewicki,et al. Learning higher-order structures in natural images , 2003, Network.

[51] R V Shannon,et al. Speech Recognition with Primarily Temporal Cues , 1995, Science.

[52] Dennis Gabor,et al. Theory of communication , 1946 .

[53] Rhodri Cusack,et al. The Intraparietal Sulcus and Perceptual Organization , 2005, Journal of Cognitive Neuroscience.

[54] Michael S. Lewicki,et al. Emergence of complex cell properties by learning to generalize in natural scenes , 2009, Nature.

[55] P. Cz.. Handbuch der physiologischen Optik , 1896 .

[56] Lie Lu,et al. Audio textures: theory and applications , 2004, IEEE Transactions on Speech and Audio Processing.

[57] Vladik Kreinovich,et al. Best student paper award , 1996, Reliab. Comput..

[58] G. A. Miller,et al. The Trill Threshold , 1950 .

[59] Marvin H. J. Guber. Bayesian Spectrum Analysis and Parameter Estimation , 1988 .

[60] C. J. Darwin,et al. Chapter 11 – Auditory Grouping , 1995 .

[61] L. V. Noorden. Temporal coherence in the perception of tone sequences , 1975 .

[62] Michael S. Lewicki,et al. Efficient Coding of Time-Relative Structure Using Spikes , 2005, Neural Computation.

[63] J. Rauschecker,et al. Perceptual Organization of Tone Sequences in the Auditory Cortex of Awake Macaques , 2005, Neuron.

[64] Aapo Hyvärinen,et al. Bubbles: a unifying framework for low-level statistical properties of natural image sequences. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[65] M. Lewicki,et al. Learning higher-order structures in natural images. , 2003 .

[66] Radford M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[67] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[68] Andreas Spanias,et al. Speech coding: a tutorial review , 1994, Proc. IEEE.

[69] D. Pressnitzer,et al. Perceptual Organization of Sound Begins in the Auditory Periphery , 2008, Current Biology.

[70] Bo Wang,et al. Lack of Consistency of Mean Field and Variational Bayes Approximations for State Space Models , 2004, Neural Processing Letters.

[71] Konrad Paul Kording,et al. How are complex cell properties adapted to the statistics of natural stimuli? , 2004, Journal of neurophysiology.

[72] R. Carlyon,et al. Detecting pitch-pulse asynchronies and differences in fundamental frequency. , 1994, The Journal of the Acoustical Society of America.

[73] R. Patterson,et al. B OF THE SVOS FINAL REPORT ( Part A : The Auditory Filterbank ) AN EFFICIENT AUDITORY FIL TERBANK BASED ON THE GAMMATONE FUNCTION , 2010 .

[74] J. L. Goldstein,et al. Evidence for a general template in central optimal processing for pitch of complex tones. , 1978, The Journal of the Acoustical Society of America.

[75] D. Grantham,et al. Modulation masking: effects of modulation frequency, depth, and phase. , 1989, The Journal of the Acoustical Society of America.

[76] M. Dorman,et al. Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs. , 1997, The Journal of the Acoustical Society of America.

[77] J H Grose,et al. Effects of flanking band proximity, number, and modulation pattern on comodulation masking release. , 1990, The Journal of the Acoustical Society of America.

[78] Nasser M. Nasrabadi,et al. Pattern Recognition and Machine Learning , 2006, Technometrics.

[79] Hagai Attias,et al. Temporal Low-Order Statistics of Natural Sounds , 1996, NIPS.

[80] T Dau,et al. On the role of envelope fluctuation processing in spectral masking. , 2000, The Journal of the Acoustical Society of America.

[81] Brian C J Moore,et al. Mechanisms of modulation gap detection. , 2002, The Journal of the Acoustical Society of America.

[82] Eero P. Simoncelli,et al. Natural signal statistics and sensory gain control , 2001, Nature Neuroscience.

[83] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[84] Michael I. Jordan,et al. Regression with input-dependent noise: A Gaussian process treatment , 1998 .

[85] Michael S. Lewicki,et al. A Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural Signals , 2005, Neural Computation.

[86] B C Moore,et al. Comodulation masking release as a function of bandwidth and time delay between on-frequency and flanking-band maskers. , 1990, The Journal of the Acoustical Society of America.

[87] Eivind Kvedalen. Signal processing using the Teager Energy Operator and other nonlinear operators , 2003 .

[88] Zoubin Ghahramani,et al. Optimization with EM and Expectation-Conjugate-Gradient , 2003, ICML.

[89] Petros Maragos,et al. A comparison of the energy operator and the Hilbert transform approach to signal and speech demodulation , 1994, Signal Process..

[90] Yannis Stylianou,et al. Modeling Speech Based on Harmonic Plus Noise Models , 2004, Summer School on Neural Networks.

[91] Aapo Hyvärinen,et al. A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images , 2001, Vision Research.

[92] Michael S. Lewicki,et al. Efficient auditory coding , 2006, Nature.

[93] S. Pinker,et al. Auditory streaming and the building of timbre. , 1978, Canadian journal of psychology.

[94] Nicola Orio,et al. Music Retrieval: A Tutorial and Review , 2006, Found. Trends Inf. Retr..

[95] Liubomire G. Iordanov. The Principal Component Structure of Natural Sound , 1999, NIPS 1999.

[96] N. Viemeister,et al. Cues for discrimination of envelopes. , 1996, The Journal of the Acoustical Society of America.

[97] Karl J. Friston,et al. A theory of cortical responses , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[98] K. O’Connor,et al. Encoding of Illusory Continuity in Primary Auditory Cortex , 2007, Neuron.

[99] Konrad P. Körding,et al. Extracting Slow Subspaces from Natural Videos Leads to Complex Cells , 2001, ICANN.

[100] Peter Dayan,et al. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , 2001 .

[101] Laurenz Wiskott,et al. Slow feature analysis yields a rich repertoire of complex cell properties. , 2005, Journal of vision.

[102] Stephen McAdams,et al. Spectral fusion, spectral parsing and the formation of auditory images , 1984 .

[103] Michael E. Tipping,et al. Probabilistic Principal Component Analysis , 1999 .

[104] T. Başar,et al. A New Approach to Linear Filtering and Prediction Problems , 2001 .

[105] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[106] G. Stickney,et al. On the dichotomy in auditory perception between temporal envelope and fine structure cues. , 2004, The Journal of the Acoustical Society of America.

[107] Daniel Pressnitzer,et al. The psychophysics and physiology of comodulation masking release , 2003, Experimental Brain Research.

[108] D. McFadden,et al. Comodulation masking release: effects of varying the level, duration, and time delay of the cue band. , 1986, The Journal of the Acoustical Society of America.

[109] W A Yost,et al. Modulation interference in detection and discrimination of amplitude modulation. , 1989, The Journal of the Acoustical Society of America.

[110] B A Wright,et al. Comodulation masking release for single and multiple rates of envelope fluctuation. , 1994, The Journal of the Acoustical Society of America.

[111] David J. C. MacKay,et al. Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[112] E. Owens,et al. An Introduction to the Psychology of Hearing , 1997 .

[113] Les E. Atlas,et al. A non-uniform modulation transform for audio coding with increased time resolution , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[114] Daniel P. W. Ellis,et al. PREDICTION-DRIVEN COMPUTATIONAL AUDITORY SCENE ANALYSIS FOR DENSE SOUND MIXTURES , 1996 .

[115] David Vakman,et al. Instantaneous frequency estimation and measurement: a quasi-local method , 2002 .

[116] Tara N. Sainath,et al. Acoustic landmark detection and segmentation using the McAulay-Quatieri Sinusoidal Model , 2005 .

[117] Andrew J Oxenham,et al. Human Cortical Activity during Streaming without Spectral Cues Suggests a General Neural Substrate for Auditory Stream Segregation , 2007, The Journal of Neuroscience.

[118] Malcolm Slaney,et al. Solving Demodulation as an Optimization Problem , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[119] N. Sutherland,et al. Grouping Frequency Components of Vowels: When is a Harmonic not a Harmonic? , 1984 .

[120] M. Portnoff. Time-frequency representation of digital signals and systems based on short-time Fourier analysis , 1980 .

[121] R. E. Kalman,et al. A New Approach to Linear Filtering and Prediction Problems , 2002 .

[122] Tai Sing Lee,et al. Hierarchical Bayesian inference in the visual cortex. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[123] T. Dau,et al. Characterizing frequency selectivity for envelope fluctuations. , 2000, The Journal of the Acoustical Society of America.

[124] Sabine Van Huffel,et al. Perceptual audio modeling with exponentially damped sinusoids , 2005, Signal Process..

[125] R. Plomp,et al. Effect of temporal envelope smearing on speech reception. , 1994, The Journal of the Acoustical Society of America.

[126] R. Plomp. The Role of Modulation in Hearing , 1983 .

[127] B C Moore,et al. Modulation masking produced by beating modulators. , 1999, The Journal of the Acoustical Society of America.

[128] Manfred R. Schroeder,et al. Vocoders: Analysis and synthesis of speech , 1966 .

[129] R. T. Cox,et al. The Algebra of Probable Inference , 1962 .

[130] R. Carlyon,et al. Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms? , 1994 .

[131] O Ghitza,et al. On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception. , 2001, The Journal of the Acoustical Society of America.

[132] N. Shephard,et al. Stochastic Volatility: Origins and Overview , 2008 .

[133] Daniel P. W. Ellis,et al. An Introduction to Signal Processing for Speech , 2010 .

[134] Qin Li,et al. Homomorphic modulation spectra , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[135] Eero P. Simoncelli,et al. Sound texture synthesis via filter statistics , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[136] D. Broadbent,et al. Information Conveyed by Vowels , 1957 .

[137] Simon J. Godsill,et al. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals , 2005, 2005 13th European Signal Processing Conference.

[138] Brian C J Moore,et al. Testing the concept of a modulation filter bank: the audibility of component modulation and detection of phase change in three-component modulators. , 2003, The Journal of the Acoustical Society of America.

[139] Hynek Hermansky,et al. On properties of modulation spectrum for robust automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[140] Myoung An,et al. Time-frequency representations , 1997, Applied and numerical harmonic analysis.

[141] Thomas F. Quatieri,et al. Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[142] Richard E. Turner,et al. A Structured Model of Video Reproduces Primary Visual Cortical Organisation , 2009, PLoS Comput. Biol..

[143] T. Houtgast. Frequency selectivity in amplitude-modulation detection. , 1989, The Journal of the Acoustical Society of America.

[144] Geoffrey E. Hinton,et al. Bayesian Learning for Neural Networks , 1995 .

[145] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[146] David Vakman,et al. On the analytic signal, the Teager-Kaiser energy algorithm, and other methods for defining amplitude and frequency , 1996, IEEE Trans. Signal Process..

[147] Brian C J Moore,et al. Auditory processing of real and illusory changes in frequency modulation (FM) phase. , 2004, The Journal of the Acoustical Society of America.

[148] Jean Laroche,et al. Phase-vocoder: about this phasiness business , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[149] Arnold Neumaier,et al. Introduction to Numerical Analysis , 2001 .

[150] David J. Field,et al. Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[151] R. Ilmoniemi,et al. Temporal window of integration of auditory information in the human brain. , 1998, Psychophysiology.

[152] M. S. Keshner. 1/f noise , 1982, Proceedings of the IEEE.

[153] Andrew J Oxenham,et al. Cortical FMRI activation to sequences of tones alternating in frequency: relationship to perceived rate and streaming. , 2007, Journal of neurophysiology.

[154] Adhemar Bultheel,et al. Linear Algebra, Rational Approximation and Orthogonal Polynomials , 1997 .

[155] Julius O. Smith,et al. Spectral modeling synthesis: A sound analysis/synthesis based on a deterministic plus stochastic decomposition , 1990 .

[156] Sam T. Roweis. Automatic Speech Processing by Inference in Generative Models , 2005, Speech Separation by Humans and Machines.

[157] B C Moore,et al. Across-channel processes in frequency modulation detection. , 1996, The Journal of the Acoustical Society of America.

[158] Robert P. Carlyon,et al. Peripheral and central components of comodulation masking release , 1985 .

[159] F. A. Bilsen,et al. Subjective Phase Effects and Combination Tones , 1974 .

[160] D. Pisoni,et al. Speech perception without traditional speech cues. , 1981, Science.

[161] M. Scherg,et al. Neuromagnetic Correlates of Streaming in Human Auditory Cortex , 2005, The Journal of Neuroscience.

[162] Alfred Mertins,et al. Sparse gammatone signal model optimized for English speech does not match the human auditory filters , 2008, Brain Research.

[163] J. Hillenbrand,et al. Acoustic characteristics of American English vowels. , 1994, The Journal of the Acoustical Society of America.

[164] Elyse S Sussman,et al. Integration and segregation in auditory scene analysis. , 2005, The Journal of the Acoustical Society of America.

[165] Mark D. Plumbley,et al. IF THE INDEPENDENT COMPONENTS OF NATURAL IMAGES ARE EDGES, WHAT ARE THE INDEPENDENT COMPONENTS OF NATURAL SOUNDS? , 2001 .

[166] Wolfram Burgard,et al. Most likely heteroscedastic Gaussian process regression , 2007, ICML '07.

[167] Marc Toussaint,et al. Modelling motion primitives and their timing in biologically executed movements , 2007, NIPS.

[168] Mitchell Steinschneider,et al. Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey , 2001, Hearing Research.

[169] S Buus,et al. Release from masking caused by envelope fluctuations. , 1985, The Journal of the Acoustical Society of America.

[170] J H Grose,et al. Comodulation masking release as a function of bandwidth and test frequency. , 1990, The Journal of the Acoustical Society of America.

[171] Hynek Hermansky,et al. RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[172] Daniel P. W. Ellis,et al. Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[173] Brian R Glasberg,et al. Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[174] Ray Meddis,et al. Virtual pitch in a computational physiological model. , 2006, The Journal of the Acoustical Society of America.

[175] G. E. Peterson,et al. Control Methods Used in a Study of the Vowels , 1951 .