Robust acoustic object detection.

We consider a novel approach to the problem of detecting phonological objects like phonemes, syllables, or words, directly from the speech signal. We begin by defining local features in the time-frequency plane with built in robustness to intensity variations and time warping. Global templates of phonological objects correspond to the coincidence in time and frequency of patterns of the local features. These global templates are constructed by using the statistics of the local features in a principled way. The templates have clear phonetic interpretability, are easily adaptable, have built in invariances, and display considerable robustness in the face of additive noise and clutter from competing speakers. We provide a detailed evaluation of the performance of some diphone detectors and a word detector based on this approach. We also perform some phonetic classification experiments based on the edge-based features suggested here.

[1]  K. Sen,et al.  Feature analysis of natural sounds in the songbird auditory forebrain. , 2001, Journal of neurophysiology.

[2]  Carlos D. Brody,et al.  Computing with Action Potentials , 1997, NIPS.

[3]  Yali Amit,et al.  Speech recognition using randomized relational decision trees , 2001, IEEE Trans. Speech Audio Process..

[4]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[5]  Eero P. Simoncelli,et al.  Natural Sound Statistics and Divisive Normalization in the Auditory System , 2000, NIPS.

[6]  Yali Amit,et al.  A Computational Model for Visual Selection , 1999, Neural Computation.

[7]  Jon Rigelsford,et al.  2D Object Detection and Recognition: Models, Algorithms and Networks , 2003 .

[8]  William Grimson,et al.  Object recognition by computer - the role of geometric constraints , 1991 .

[9]  Donald Geman,et al.  Coarse-to-Fine Face Detection , 2004, International Journal of Computer Vision.

[10]  P Niyogi,et al.  Detecting stop consonants in continuous speech. , 2002, The Journal of the Acoustical Society of America.

[11]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[12]  Lawrence K. Saul,et al.  A statistical model for robust integration of narrowband cues in speech , 2001, Comput. Speech Lang..

[13]  Victor W. Zue,et al.  Visual characterization of speech spectrograms , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Yali Amit,et al.  A Neural Network Architecture for Visual Selection , 2000, Neural Computation.

[15]  M. Riley Speech Time-Frequency Representations , 1989 .

[16]  Yali Amit,et al.  2D Object Detection and Recognition: Models, Algorithms, and Networks , 2002 .

[17]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[18]  A. Doupe,et al.  Temporal and Spectral Sensitivity of Complex Auditory Neurons in the Nucleus HVc of Male Zebra Finches , 1998, The Journal of Neuroscience.

[19]  Sharlene A. Liu,et al.  Landmark detection for distinctive feature-based speech recognition , 1996 .

[20]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[21]  S. Shamma,et al.  Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. , 1996, Journal of neurophysiology.

[22]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[23]  Hervé Bourlard,et al.  Subband-based speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.