A Transfer Learning Approach for the 2018 FEMH Voice Data Challenge

Human voice could be significantly affected by neoplasm, vocal palsy, and phono-trauma diseases. Computer aided diagnosis by analyzing human voice can be a remote and cost-effective tool for patients around the world. In this paper, we propose a deep transfer learning approach to differentiate pathological voice samples from normal ones. We utilize voice samples recorded from 200 patients at the Far Eastern Memorial Hospital (FEMH) to develop the deep transfer learning model. We extract prosodic, vocal tract and excitation features as new representations from the voice samples for diagnosis. To address the small data set challenge, we utilize the TIMIT dataset and develop a transfer learning approach in which a deep belief network (DBN) is first trained with the TIMIT data set. The trained model is then applied to the FEMH data set as a feature extractor. Finally, we train a support vector machine (SVM) classifier with the extracted features for diagnosis. We evaluate our approach using the leave one out cross validation (LOOCV) strategy on the 200 training patients, and achieve 94.90% sensitivity with 59.77% un-weighted average recall (UAR) for the 400 FEMH testing patients. Our results prove that the proposed method may be used effectively for pathological voice detection.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jiang Li,et al.  Seagrass Detection in Coastal Water Through Deep Capsule Networks , 2018, PRCV.

[3]  Ioannis Pitas,et al.  Automatic detection of vocal fold paralysis and edema , 2004, INTERSPEECH.

[4]  Ellen S. Deutsch,et al.  Clinical Practice Guideline: Hoarseness (Dysphonia) , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[5]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[6]  Liqing Zhang,et al.  Credit Card Fraud Detection Using Convolutional Neural Networks , 2016, ICONIP.

[7]  Pedro Gómez Vilda,et al.  Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors , 2004, IEEE Transactions on Biomedical Engineering.

[8]  I. Titze Orkshop on Acoustic Voice Analysis Summary Statement Vv 2 Workshop on Acoustic Voice Analysis , 2022 .

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Jiang Li,et al.  Detection of seagrass scars using sparse coding and morphological filter , 2018, Remote Sensing of Environment.

[12]  Louis ten Bosch,et al.  A novel feature transformation for vocal tract length normalization in automatic speech recognition , 1998, IEEE Trans. Speech Audio Process..

[13]  Ning Wang,et al.  Robust Speaker Recognition Using Denoised Vocal Source and Vocal Tract Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  D G Childers,et al.  Gender recognition from speech. Part II: Fine analysis. , 1991, The Journal of the Acoustical Society of America.

[15]  Hideki Kasuya,et al.  An acoustic analysis of pathological voice and its application to the evaluation of laryngeal pathology , 1986, Speech Commun..

[16]  Jiang Li,et al.  DeepCoast: Quantifying Seagrass Distribution in Coastal Water Through Deep Capsule Networks , 2018, PRCV.

[17]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[18]  Jiang Li,et al.  A few-shot deep learning approach for improved intrusion detection , 2017, 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON).

[19]  Richard Zimmerman,et al.  Seagrass Propeller Scar Detection using Deep Convolutional Neural Network , 2018, 2018 9th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[20]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  Chiman Kwan,et al.  Deep Learning with Synthetic Hyperspectral Images for Improved Soil Detection in Multispectral Imagery , 2018, 2018 9th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[23]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[24]  Ping Liu,et al.  Facial Expression Recognition via a Boosted Deep Belief Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[26]  Claudio Storck,et al.  Reliable jitter and shimmer measurements in voice clinics: the relevance of vowel, gender, vocal intensity, and fundamental frequency effects in a typical clinical task. , 2011, Journal of voice : official journal of the Voice Foundation.

[27]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[28]  Yuzhong Shen,et al.  Combining Satellite Images with Feature Indices for Improved Change Detection , 2018, 2018 9th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[29]  Philip de Chazal,et al.  Identification of voice pathology using automated speech analysis , 2003, MAVEBA.

[30]  Shih-Hau Fang,et al.  Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach. , 2019, Journal of voice : official journal of the Voice Foundation.

[31]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[32]  Yuzhong Shen,et al.  Deep learning for effective detection of excavated soil related to illegal tunnel activities , 2017, 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON).

[33]  Jiang Li,et al.  A Deep Transfer Learning Approach for Improved Post-Traumatic Stress Disorder Diagnosis , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[34]  Cong Wang,et al.  DeepMag: Sniffing Mobile Apps in Magnetic Field through Deep Convolutional Neural Networks , 2018, 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom).

[35]  Yuzhong Shen,et al.  Deep Learning for Pulmonary Nodule CT Image Retrieval — An Online Assistance System for Novice Radiologists , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[36]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[37]  R. Frick Communicating emotion: The role of prosodic features. , 1985 .

[38]  Shrikanth Narayanan,et al.  Feature analysis for automatic detection of pathological speech , 2002, Proceedings of the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society] [Engineering in Medicine and Biology.

[39]  Xing Zhao,et al.  Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network , 2015, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[40]  Germán Castellanos-Domínguez,et al.  Automatic Detection of Pathological Voices Using Complexity Measures, Noise Parameters, and Mel-Cepstral Coefficients , 2011, IEEE Transactions on Biomedical Engineering.

[41]  Mireia Farrús,et al.  Jitter and shimmer measurements for speaker recognition , 2007, INTERSPEECH.

[42]  Steven J. Simske,et al.  Recognition of emotions in interactive voice response systems , 2003, INTERSPEECH.