A Near Real-Time Automatic Speaker Recognition Architecture for Voice-Based User Interface

In this paper, we present a novel pipelined near real-time speaker recognition architecture that enhances the performance of speaker recognition by exploiting the advantages of hybrid feature extraction techniques that contain the features of Gabor Filter (GF), Convolution Neural Networks (CNN), and statistical parameters as a single matrix set. This architecture has been developed to enable secure access to a voice-based user interface (UI) by enabling speaker-based authentication and integration with an existing Natural Language Processing (NLP) system. Gaining secure access to existing NLP systems also served as motivation. Initially, we identify challenges related to real-time speaker recognition and highlight the recent research in the field. Further, we analyze the functional requirements of a speaker recognition system and introduce the mechanisms that can address these requirements through our novel architecture. Subsequently, the paper discusses the effect of different techniques such as CNN, GF, and statistical parameters in feature extraction. For the classification, standard classifiers such as Support Vector Machine (SVM), Random Forest (RF) and Deep Neural Network (DNN) are investigated. To verify the validity and effectiveness of the proposed architecture, we compared different parameters including accuracy, sensitivity, and specificity with the standard AlexNet architecture.

[1]  Muhammad Haroon Yousaf,et al.  Optimized Audio Classification and Segmentation Algorithm by Using Ensemble Methods , 2015 .

[2]  Praveen Damacharla,et al.  Detection and Identification of Background Sounds to Improvise Voice Interface in Critical Environments , 2018, 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[3]  Tal Hassner,et al.  Age and gender classification using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4]  Saman A. Zonouz,et al.  CloudID: Trustworthy cloud-based and cross-enterprise biometric identification , 2015, Expert Syst. Appl..

[5]  Zied Lachiri,et al.  Gabor Filterbank Features for Robust Speech Recognition , 2014, ICISP.

[6]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[8]  Xiaoyu Liu Deep Convolutional and LSTM Neural Networks for Acoustic Modelling in Automatic Speech Recognition , .

[9]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Khalid M. O. Nahar,et al.  A Voice Identification System using Hidden Markov Model , 2015 .

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  E Chandra,et al.  A Review on Automatic Speech Recognition Architecture and Approaches , 2016 .

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[15]  Tsuhan Chen,et al.  Audio feature extraction and analysis for scene classification , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[16]  Tsuhan Chen,et al.  Audio Feature Extraction and Analysis for Scene Segmentation and Classification , 1998, J. VLSI Signal Process..

[17]  Anil K. Jain,et al.  Unsupervised texture segmentation using Gabor filters , 1990, 1990 IEEE International Conference on Systems, Man, and Cybernetics Conference Proceedings.

[18]  P ShantalaC,et al.  An Outdoor Navigation With Voice Recognition Security Application For Visually Impaired People , 2014 .

[19]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[20]  Ivan Grech,et al.  Comparative study of automatic speech recognition techniques , 2013, IET Signal Process..

[21]  Lonce L. Wyse,et al.  Audio Spectrogram Representations for Processing with Convolutional Neural Networks , 2017, ArXiv.

[22]  Neera Batra,et al.  Issues and Challenges of Voice Recognition in Pervasive Environment , 2017 .

[23]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[24]  Lars Kai Hansen,et al.  A New Database for Speaker Recognition , 2005 .

[25]  Tong Li,et al.  GMM and CNN Hybrid Method for Short Utterance Speaker Recognition , 2018, IEEE Transactions on Industrial Informatics.

[26]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[27]  Inma Hernáez,et al.  Household Sound Identification for People with Hearing Disabilities , 2007, CVHI.

[28]  Brendan J. Frey,et al.  Are Random Forests Truly the Best Classifiers? , 2016, J. Mach. Learn. Res..

[29]  Praveen Damacharla,et al.  Common Metrics to Benchmark Human-Machine Teams (HMT): A Review , 2020, ArXiv.

[30]  Marcus Liwicki,et al.  DeXpression: Deep Convolutional Neural Network for Expression Recognition , 2015, ArXiv.

[31]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Aryan Mobiny,et al.  Text-Independent Speaker Verification Using Long Short-Term Memory Networks , 2018, ArXiv.

[33]  Hossein Salehghaffari,et al.  Speaker Verification using Convolutional Neural Networks , 2018, ArXiv.

[34]  Nelson Morgan,et al.  Informative spectro-temporal bottleneck features for noise-robust speech recognition , 2013, INTERSPEECH.

[35]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[36]  Kaushik Roy,et al.  Gabor filter assisted energy efficient fast learning Convolutional Neural Networks , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[37]  Nian Zhang,et al.  Software/Hardware Implementation of an Adaptive Noise Cancellation System , 2013 .

[38]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[39]  Dwijen Rudrapal,et al.  Voice Recognition and Authentication as a Proficient Biometric Tool and its Application in Online Exam for P.H People , 2012 .

[40]  Nikolas P. Galatsanos,et al.  A support vector machine approach for detection of microcalcifications , 2002, IEEE Transactions on Medical Imaging.

[41]  Nelson Morgan,et al.  Robust CNN-based speech recognition with Gabor filter kernels , 2014, INTERSPEECH.

[42]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[43]  M. Makary,et al.  Medical error—the third leading cause of death in the US , 2016, British Medical Journal.

[44]  F. Malik,et al.  Quantized histogram color features analysis for image retrieval based on median and Laplacian filters in DCT domain , 2012, 2012 International Conference on Innovation Management and Technology Research.

[45]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[46]  Chris Rowen,et al.  Using Convolutional Neural Networks for Image Recognition By , 2015 .

[47]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[48]  Praveen Damacharla,et al.  Effects of Voice-Based Synthetic Assistant on Performance of Emergency Care Provider in Training , 2019, International Journal of Artificial Intelligence in Education.

[49]  Pedro J. Moreno,et al.  A Real-Time End-to-End Multilingual Speech Recognition Architecture , 2015, IEEE Journal of Selected Topics in Signal Processing.

[50]  Yichuan Tang,et al.  Deep Learning using Linear Support Vector Machines , 2013, 1306.0239.

[51]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[52]  Sri Harish Reddy Mallidi,et al.  On the relevance of auditory-based Gabor features for deep learning in robust speech recognition , 2017, Comput. Speech Lang..

[53]  Sai Prabhakar Pandi Selvaraj Deep Learning for Speaker Recognition , 2016 .