On Neural Phone Recognition of Mixed-Source ECoG Signals

The emerging field of neural speech recognition (NSR) using electrocorticography has recently attracted remarkable research interest for studying how human brains recognize speech in quiet and noisy surroundings. In this study, we demonstrate the utility of NSR systems to objectively prove the ability of human beings to attend to a single speech source while suppressing the interfering signals in a simulated cocktail party scenario. The experimental results show that the relative degradation of the NSR system performance when tested in a mixed-source scenario is significantly lower than that of automatic speech recognition (ASR). In this paper, we have significantly enhanced the performance of our recently published framework by using manual alignments for initialization instead of the flat start technique. We have also improved the NSR system performance by accounting for the possible transcription mismatch between the acoustic and neural signals.

[1]  Tanja Schultz,et al.  Brain-to-text: decoding spoken phrases from phone representations in the brain , 2015, Front. Neurosci..

[2]  Robert T. Knight,et al.  Cortical Spatio-temporal Dynamics Underlying Phonological Target Detection in Humans , 2011, Journal of Cognitive Neuroscience.

[3]  Rajesh P. N. Rao,et al.  Localization and classification of phonemes using high spatial resolution electrocorticography (ECoG) grids , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Robert D Flint,et al.  Direct classification of all American English phonemes using signals from functional speech motor cortex , 2014, Journal of neural engineering.

[6]  J. Rauschecker,et al.  Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing , 2009, Nature Neuroscience.

[7]  S. Scott,et al.  The neuroanatomical and functional organization of speech perception , 2003, Trends in Neurosciences.

[8]  N. Mesgarani,et al.  Selective cortical representation of attended speaker in multi-talker speech perception , 2012, Nature.

[9]  Tanja Schultz,et al.  Continuous speech recognition from ECoG , 2015, INTERSPEECH.

[10]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[11]  P R Prucnal,et al.  Exact variance-stabilizing transformations for image-signal-dependent Rayleigh and other Weibull noise sources. , 1987, Applied optics.

[12]  Ramón Fernández Astudillo,et al.  Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty , 2013, IEEE Signal Processing Letters.

[13]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[14]  Nelson Morgan,et al.  Phone Recognition for Mixed Speech Signals : Comparison of Human Auditory Cortex and Machine Performance , 2015 .

[15]  Nima Mesgarani,et al.  Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity , 2016, Journal of neural engineering.

[16]  W. T. Nelson,et al.  A speech corpus for multitalker communications research. , 2000, The Journal of the Acoustical Society of America.

[17]  E. Chang,et al.  Categorical Speech Representation in Human Superior Temporal Gyrus , 2010, Nature Neuroscience.

[18]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[19]  G. Schalk,et al.  Decoding vowels and consonants in spoken and imagined words using electrocorticographic signals in humans , 2011, Journal of neural engineering.

[20]  Robert T. Knight,et al.  Spatiotemporal imaging of cortical activation during verb generation and picture naming , 2010, NeuroImage.

[21]  J. Bendat,et al.  Random Data: Analysis and Measurement Procedures , 1971 .