Phoneme awareness provides the path to high resolution speech recognition to overcome the difficulties of classical word recognition. Here we present the results of a preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet, with a specific focus on evolutionary optimisation of bio-inspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico. For each recording, the data were pre-processed, using Mel-Frequency Cepstral Coefficients (MFCC) at a sliding window of 200ms per data object, as well as a further MFCC timeseries format for forecast-based models, to produce the dataset. We found that an evolutionary optimised deep neural network achieves 90.77% phoneme classification accuracy as opposed to the best HMM of 150 hidden units achieving 86.23% accuracy. Many of the evolutionary solutions take substantially longer to train than the HMM, however one solution scoring 87.5% (+1.27%) requires fewer resources than the HMM.
[1]
Julie S. Amberg,et al.
Introduction: What is language?
,
2009
.
[2]
I. Elamvazuthi,et al.
Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques
,
2010,
ArXiv.
[3]
Lawrence R. Rabiner,et al.
Automatic Speech Recognition - A Brief History of the Technology Development
,
2004
.
[4]
Anikó Ekárt,et al.
A Deep Evolutionary Approach to Bioinspired Classifier Optimisation for Brain-Machine Interaction
,
2019,
Complex..
[5]
L. Baum,et al.
Statistical Inference for Probabilistic Functions of Finite State Markov Chains
,
1966
.