Trends in audio signal feature extraction methods

Abstract Audio signal processing algorithms generally involves analysis of signal, extracting its properties, predicting its behaviour, recognizing if any pattern is present in the signal, and how a particular signal is correlated to another similar signals. Audio signal includes music, speech and environmental sounds. Over the last few decades, audio signal processing has grown significantly in terms of signal analysis and classification. And it has been proven that solutions of many existing issues can be solved by integrating the modern machine learning (ML) algorithms with the audio signal processing techniques. The performance of any ML algorithm depends on the features on which the training and testing is done. Hence feature extraction is one of the most vital part of a machine learning process. The aim of this study is to summarize the literature of the audio signal processing specially focusing on the feature extraction techniques. In this survey the temporal domain, frequency domain, cepstral domain, wavelet domain and time-frequency domain features are discussed in detail.

[1]  Karthikeyan Umapathy,et al.  Audio Signal Feature Extraction and Classification Using Local Discriminant Bases , 2004, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Hynek Hermansky,et al.  Integrating RASTA-PLP into speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Keith A. Johnson,et al.  Acoustic and Auditory Phonetics , 1997, Phonetica.

[4]  Eric Moulines,et al.  A blind source separation technique using second-order statistics , 1997, IEEE Trans. Signal Process..

[5]  Voula C. Georgopoulos,et al.  Wigner Distribution Representation and Analysis of Audio Signals: An Illustrated Tutorial Review , 1999 .

[6]  Volker Hohmann,et al.  Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency , 2011, Speech Commun..

[7]  HEMA A MURTHY,et al.  Group delay functions and its applications in speech technology , 2011 .

[8]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Jian Yang,et al.  Dominant Feature Vectors Based Audio Similarity Measure , 2004, PCM.

[10]  Adam Glowacz,et al.  Fault diagnosis of single-phase induction motor based on acoustic signals , 2019, Mechanical Systems and Signal Processing.

[11]  George Tzanetakis,et al.  Stereo Panning Features for Classifying Recording Production Style , 2007, ISMIR.

[12]  Karthikeyan Umapathy,et al.  Audio Signal Processing Using Time-Frequency Approaches: Coding, Classification, Fingerprinting, and Watermarking , 2010, EURASIP J. Adv. Signal Process..

[13]  Calvin G. Howard Speech Analysis‐Synthesis Scheme Using Continuous Parameters , 1956 .

[14]  Mohan S. Kankanhalli,et al.  Precise pitch profile feature extraction from musical audio for key detection , 2006, IEEE Transactions on Multimedia.

[15]  George Tzanetakis,et al.  Audio Analysis using the Discrete Wavelet Transform , 2001 .

[16]  Hakan Erdogan,et al.  Single channel speech-music separation using matching pursuit and spectral masks , 2011, 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU).

[17]  Björn W. Schuller,et al.  Low Level Texture Features for Snore Sound Discrimination , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[18]  George Tzanetakis,et al.  Polyphonic audio matching and alignment for music retrieval , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[19]  Abdullah I. Al-Shoshan,et al.  Speech and Music Classification and Separation: A Review , 2006 .

[20]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[21]  Jan Larsen,et al.  Decision time horizon for music genre classification using short time features , 2004, 2004 12th European Signal Processing Conference.

[22]  Sridhar Krishnan,et al.  Combining Temporal Features by Local Binary Pattern for Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  M. R. Schroeder,et al.  Short‐Time “Cepstrum” Pitch Detection , 1964 .

[24]  Qian Huang,et al.  Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection , 2016, Multimedia Tools and Applications.

[25]  Asma Rabaoui,et al.  Using One-Class SVMs and Wavelets for Audio Surveillance , 2008, IEEE Transactions on Information Forensics and Security.

[26]  W. B. Snow,et al.  Audible Frequency Ranges of Music, Speech and Noise , 1931 .

[27]  K. N. Stevens,et al.  Autocorrelation Analysis of Speech Sounds , 1950 .

[28]  Lang He,et al.  Automated depression analysis using convolutional neural networks from speech , 2018, J. Biomed. Informatics.

[29]  Patrick Susini,et al.  The Timbre Toolbox: extracting audio descriptors from musical signals. , 2011, The Journal of the Acoustical Society of America.

[30]  Shrikanth Narayanan,et al.  Feature analysis for automatic detection of pathological speech , 2002, Proceedings of the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society] [Engineering in Medicine and Biology.

[31]  Larry P. Heck,et al.  Robust text-independent speaker identification over telephone channels , 1999, IEEE Trans. Speech Audio Process..

[32]  Shumeet Baluja,et al.  Waveprint: Efficient wavelet-based audio fingerprinting , 2008, Pattern Recognit..

[33]  Ying Chen,et al.  Combining Multimodal Features with Hierarchical Classifier Fusion for Emotion Recognition in the Wild , 2014, ICMI.

[34]  Douglas Eck,et al.  Aggregate features and ADABOOST for music classification , 2006, Machine Learning.

[35]  Sung Wook Baik,et al.  Deep features-based speech emotion recognition for smart affective services , 2017, Multimedia Tools and Applications.

[36]  Goutam Saha,et al.  Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition , 2012, Speech Commun..

[37]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[38]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  J. C. Steinberg,et al.  Toward the Specification of Speech , 1950 .

[40]  Shashidhar G. Koolagudi,et al.  Classification of vocal and non-vocal regions from audio songs using spectral features and pitch variations , 2015, 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE).

[41]  Tsuhan Chen,et al.  Audio feature extraction and analysis for scene classification , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[42]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[43]  G. Gambardella A contribution to the theory of short-time spectral analysis with nonuniform bandwidth filters , 1971 .

[44]  Thippur V. Sreenivas,et al.  Dynamic programming based segmentation approach to LSF matrix reconstruction , 2005, INTERSPEECH.

[45]  Yi Liu,et al.  Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing , 2015, INTERSPEECH.

[46]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[47]  P. D. Eimas,et al.  Perceptual differences in infant cries revealed by modifications of acoustic features. , 1997, The Journal of the Acoustical Society of America.

[48]  Kai Yu,et al.  Deep features for automatic spoofing detection , 2016, Speech Communication.

[49]  Adam Glowacz,et al.  Acoustic-Based Fault Diagnosis of Commutator Motor , 2018, Electronics.

[50]  Aathreya S. Bhat,et al.  An Efficient Classification Algorithm for Music Mood Detection in Western and Hindi Music Using Audio Feature Extraction , 2014, 2014 Fifth International Conference on Signal and Image Processing.

[51]  Mark B. Sandler,et al.  Classification of audio signals using statistical features on time and wavelet transform domains , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[52]  Farshad Almasganj,et al.  Audio-visual feature fusion via deep neural networks for automatic speech recognition , 2018, Digit. Signal Process..

[53]  Lianhong Cai,et al.  Cultural style based music classification of audio signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  M. Shamim Hossain,et al.  Patient State Recognition System for Healthcare Using Speech and Facial Expressions , 2016, Journal of Medical Systems.

[55]  Rajesh M. Hegde,et al.  Application of the modified group delay function to speaker identification and discrimination , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[56]  Qi Li,et al.  Recognition of noisy speech using dynamic spectral subband centroids , 2004, IEEE Signal Processing Letters.

[57]  Ashfaq A. Khokhar,et al.  Content-based indexing and retrieval of audio data using wavelets , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[58]  Zhouyu Fu,et al.  A Survey of Audio-Based Music Classification and Annotation , 2011, IEEE Transactions on Multimedia.

[59]  Thippur V. Sreenivas,et al.  Compressive sensing for sparsely excited speech signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  D. D. Greenwood A cochlear frequency-position function for several species--29 years later. , 1990, The Journal of the Acoustical Society of America.

[61]  Naim Baydar,et al.  A comparative study of acoustic and vibration signals in detection of gear failures using Wigner-Ville distribution. , 2001 .

[62]  G. Gambardella Time Scaling and Short‐Time Spectral Analysis , 1968 .

[63]  Reinhold Häb-Umbach,et al.  Model-Based Feature Enhancement for Reverberant Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Patrick J Clemins,et al.  Generalized perceptual linear prediction features for animal vocalization analysis. , 2006, The Journal of the Acoustical Society of America.

[65]  Turker Tuncer,et al.  Turkish vowel classification based on acoustical and decompositional features optimized by Genetic Algorithm , 2019 .

[66]  Kristoffer Jensen Pitch independent prototyping of musical sounds , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[67]  Leon Cohen,et al.  Positive time-frequency distribution functions , 1985, IEEE Trans. Acoust. Speech Signal Process..

[68]  G. A. Miller The Perception of Speech. , 1951 .

[69]  Bayya Yegnanarayana,et al.  Determination of instants of significant excitation in speech using group delay function , 1995, IEEE Trans. Speech Audio Process..

[70]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[71]  Ze-Nian Li,et al.  Audio feature reduction and analysis for automatic music genre classification , 2014, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[72]  Sengul Dogan,et al.  Automatic voice based disease detection method using one dimensional local binary pattern feature extraction network , 2019 .

[73]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[74]  Boualem Boashash,et al.  Application of the Cross-Wigner-Ville Distribution to Seismic Data-Processing , 1992 .

[75]  Adam Glowacz,et al.  Fault Detection of Electric Impact Drills and Coffee Grinders Using Acoustic Signals , 2019, Sensors.

[76]  August W. Rihaczek,et al.  Signal energy distribution in time and frequency , 1968, IEEE Trans. Inf. Theory.

[77]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[78]  E. Mendoza,et al.  Differences in voice quality between men and women: use of the long-term average spectrum (LTAS). , 1996, Journal of voice : official journal of the Voice Foundation.

[79]  Karthikeyan Umapathy,et al.  Multigroup classification of audio signals using time-frequency parameters , 2005, IEEE Transactions on Multimedia.

[80]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[81]  Emanuele Pollastri,et al.  Musical Instrument Timbres Classification with Spectral Features , 2003, EURASIP J. Adv. Signal Process..

[82]  J. Liss,et al.  Discriminating dysarthria type from envelope modulation spectra. , 2010, Journal of speech, language, and hearing research : JSLHR.

[83]  Roberto Togneri,et al.  Spectrotemporal Analysis Using Local Binary Pattern Variants for Acoustic Scene Classification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[84]  A. Noll Short‐Time Spectrum and “Cepstrum” Techniques for Vocal‐Pitch Detection , 1964 .

[85]  Abeer Alwan,et al.  Source and channel coding for remote speech recognition over error-prone channels , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[86]  Allen Newell,et al.  The psychology of human-computer interaction , 1983 .

[87]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[88]  Fei Wang,et al.  Tag Integrated Multi-Label Music Style Classification with Hypergraph , 2009, ISMIR.

[89]  Sengul Dogan,et al.  Novel dynamic center based binary and ternary pattern network using M4 pooling for real world voice recognition , 2019 .

[90]  Mireia Farrús,et al.  Jitter and shimmer measurements for speaker recognition , 2007, INTERSPEECH.

[91]  Reza Malekian,et al.  Development and trend of condition monitoring and fault diagnosis of multi-sensors information fusion for rolling bearings: a review , 2018 .

[92]  Axel Röbel,et al.  Dynamic Spectral Envelope Modeling for Timbre Analysis of Musical Instrument Sounds , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[93]  William A. Sethares,et al.  Beat tracking of musical performances using low-level audio features , 2005, IEEE Transactions on Speech and Audio Processing.

[94]  George Tzanetakis,et al.  Stereo Panning Information for Music Information Retrieval Tasks , 2010 .

[95]  J. Liss,et al.  Vowel acoustics in dysarthria: speech disorder diagnosis and classification. , 2014, Journal of speech, language, and hearing research : JSLHR.

[96]  David Gerhard,et al.  Audio Signal Classification: History and Current Techniques , 2003 .

[97]  Christian Breiteneder,et al.  Discrimination and retrieval of animal sounds , 2006, 2006 12th International Multi-Media Modelling Conference.

[98]  Shih-Fu Chang,et al.  Short-term audio-visual atoms for generic video concept classification , 2009, ACM Multimedia.

[99]  Kun Qian,et al.  A Bag of Wavelet Features for Snore Sound Classification , 2019, Annals of Biomedical Engineering.

[100]  Francesc Alías,et al.  Gammatone Cepstral Coefficients: Biologically Inspired Features for Non-Speech Audio Classification , 2012, IEEE Transactions on Multimedia.

[101]  Siliang Lu,et al.  Fault Diagnosis of Motor Bearing by Analyzing a Video Clip , 2016 .

[102]  Ian Burnett,et al.  Musical Onset Detection using MPEG-7 Audio Descriptors , 2010 .

[103]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field [Exploratory DSP] , 2010, IEEE Signal Processing Magazine.

[104]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[105]  Aki Härmä,et al.  Classification of Time-Frequency Regions in Stereo Audio , 2010 .

[106]  Sengul Dogan,et al.  A novel octopus based Parkinson’s disease and gender recognition method using vowels , 2019 .

[107]  Roberto Pieraccini The Voice in the Machine: Building Computers That Understand Speech , 2012 .

[108]  S. R. Mahadeva Prasanna,et al.  Spectral slope based analysis and classification of stressed speech , 2011, Int. J. Speech Technol..

[109]  Samuel Kim,et al.  Detecting pathological speech using contour modeling of harmonic-to-noise ratio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[110]  Yann LeCun,et al.  Unsupervised Learning of Sparse Features for Scalable Audio Classification , 2011, ISMIR.

[111]  C.-C. Jay Kuo,et al.  Environmental sound recognition with CELP-based features , 2011, ISSCS 2011 - International Symposium on Signals, Circuits and Systems.

[112]  Xiaoling Yang,et al.  Comparative Study on Voice Activity Detection Algorithm , 2010, 2010 International Conference on Electrical and Control Engineering.

[113]  Zhen-Yang Wu,et al.  Robust GMM Based Gender Classification using Pitch and RASTA-PLP Parameters of Speech , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[114]  Fatih Ertam,et al.  An effective gender recognition approach using voice data via deeper LSTM networks , 2019 .

[115]  Yoichi Ando,et al.  Autocorrelation-based features for speech representation , 2013 .

[116]  Bo Xu,et al.  SVM-based audio scene classification , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[117]  T. Kinjo,et al.  On HMM Speech Recognition Based on Complex Speech Analysis , 2006, IECON 2006 - 32nd Annual Conference on IEEE Industrial Electronics.

[118]  Sridhar Krishnan,et al.  Time–Frequency Matrix Feature Extraction and Classification of Environmental Audio Signals , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[119]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.

[120]  Hynek Hermansky,et al.  Spectral entropy based feature for robust ASR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[121]  C. Avendano,et al.  Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[122]  Lie Lu,et al.  Automatic mood detection and tracking of music audio signals , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[123]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[124]  Ghulam Muhammad,et al.  Environment Recognition from Audio Using MPEG-7 Features , 2009, 2009 Fourth International Conference on Embedded and Multimedia Computing.

[125]  D. Hardt,et al.  Spectral subtraction and RASTA-filtering in text-dependent HMM-based speaker verification , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[126]  Ali Javed,et al.  Fall detection through acoustic Local Ternary Patterns , 2018, Applied Acoustics.

[127]  Caldwell P. Smith A Phoneme Detector , 1951 .

[128]  Frieda Goldman-Eisler,et al.  Speech Analysis and Mental Processes , 1958 .

[129]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..