Phone Recognition for Mixed Speech Signals : Comparison of Human Auditory Cortex and Machine Performance

It is well known that human beings can often attend to a single sound source within a mixed signal from multiple sources, and that unaided automatic speech recognition (without the benefit of effective blind source separation) is quite poor at this task. Here we report on the analysis of human cortical signals to demonstrate the relative robustness of these signals to the mixed signal phenomenon, which is contrasted to a deep neural network-based ASR system. Confirming this difference with a carefully designed experiment is the first step towards ultimately improving blind source separation for the purpose of speech recognition; in particular, the design of features extracted from the neural signals is leading to insights about the corresponding feature extraction on the acoustic side, e.g., for CASA systems of the future.

[1]  W. T. Nelson,et al.  A speech corpus for multitalker communications research. , 2000, The Journal of the Acoustical Society of America.

[2]  R. Lesser,et al.  Functional mapping of human sensorimotor cortex with electrocorticographic spectral analysis. II. Event-related synchronization in the gamma band. , 1998, Brain : a journal of neurology.

[3]  E. Chang,et al.  Categorical Speech Representation in Human Superior Temporal Gyrus , 2010, Nature Neuroscience.

[4]  N. Mesgarani,et al.  Selective cortical representation of attended speaker in multi-talker speech perception , 2012, Nature.

[5]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[6]  P R Prucnal,et al.  Exact variance-stabilizing transformations for image-signal-dependent Rayleigh and other Weibull noise sources. , 1987, Applied optics.

[7]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[8]  H. Berger Über das Elektrenkephalogramm des Menschen , 1938, Archiv für Psychiatrie und Nervenkrankheiten.

[9]  Mohamed Ibnkahla,et al.  Diversity Techniques , 2008, Encyclopedia of Wireless and Mobile Communications.

[10]  Katharina Burger,et al.  Random Data Analysis And Measurement Procedures , 2016 .

[11]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[12]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[13]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[15]  J. Bendat,et al.  Random Data: Analysis and Measurement Procedures , 1987 .

[16]  Ioannis Pitas,et al.  Nonlinear Digital Filters - Principles and Applications , 1990, The Springer International Series in Engineering and Computer Science.

[17]  R. Arens,et al.  Complex processes for envelopes of normal noise , 1957, IRE Trans. Inf. Theory.

[18]  Erik Edwards,et al.  Comparison of time-frequency responses and the event-related potential to auditory speech stimuli in human cortex. , 2009, Journal of neurophysiology.

[19]  G. Burroughs,et al.  THE ROTATION OF PRINCIPAL COMPONENTS , 1961 .

[20]  Chris H. Q. Ding,et al.  Convex and Semi-Nonnegative Matrix Factorizations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[22]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[23]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[24]  B. Gordon,et al.  Induced electrocorticographic gamma activity during auditory perception , 2001, Clinical Neurophysiology.

[25]  Michael Elad,et al.  Sparse and Redundant Representation Modeling—What Next? , 2012, IEEE Signal Processing Letters.

[26]  Erik Edwards Electrocortical activation and human brain mapping , 2007 .

[27]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[28]  D. Poeppel,et al.  Mechanisms Underlying Selective Neuronal Tracking of Attended Speech at a “Cocktail Party” , 2013, Neuron.