Traditional Machine Learning for Pitch Detection

Pitch detection is a fundamental problem in speech processing as F0 is used in a large number of applications. Recent papers have proposed deep learning for robust pitch tracking. In this letter, we consider voicing detection as a classification problem and F0 contour estimation as a regression problem. For both tasks, acoustic features from multiple domains and traditional machine learning methods are used. The discrimination power of existing and proposed features is assessed through mutual information. Multiple supervised and unsupervised approaches are compared. A significant relative reduction of voicing errors over the best baseline is obtained—20% with the best clustering method (K-means) and 45% with a multi-layer perceptron. For F0 contour estimation, the benefits of regression techniques are limited though. We investigate whether those objective gains translate in a parametric synthesis task. Clear perceptual preferences are observed for the proposed approach over two widely used baselines (robust algorithm for pitch tracking (RAPT) and distributed inline-filter operation (DIO)).

[1]  Hideki Kawahara,et al.  Fast and Reliable F0 Estimation Method Based on the Period Extraction of Vocal Fold Vibration of Singing Voice and Speech , 2009 .

[2]  Yusuke Kida,et al.  Voice Activity Detection: Merging Source and Filter-based Information , 2016, IEEE Signal Processing Letters.

[3]  DeLiang Wang,et al.  Robust pitch tracking in noisy speech using speaker-dependent deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Masanori Morise,et al.  D4C, a band-aperiodicity estimator for high-quality speech synthesis , 2016, Speech Commun..

[5]  Stephen A. Zahorian,et al.  Yet Another Algorithm for Pitch Tracking , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[7]  Abeer Alwan,et al.  Glottal source processing: From analysis to applications , 2014, Comput. Speech Lang..

[8]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[9]  Stephanie Seneff,et al.  Robust pitch tracking for prosodic modeling in telephone speech , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[11]  David Gerhard,et al.  Pitch Extraction and Fundamental Frequency: History and Current Techniques , 2003 .

[12]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[13]  W. Marsden I and J , 2012 .

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Masanori Morise,et al.  Harvest: A High-Performance Fundamental Frequency Estimator from Speech Signals , 2017, INTERSPEECH.

[16]  Stephanie Seneff,et al.  Pitch and spectral estimation of speech based on auditory synchrony model , 1983, ICASSP.

[17]  Patrick A. Naylor,et al.  Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Naomi Harte,et al.  YIN-Bird: Improved Pitch Tracking for Bird Vocalisations , 2016, INTERSPEECH.

[19]  Xuejing Sun,et al.  Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Alain de Cheveigné,et al.  Speech f0 extraction based on Licklider's pitch perception model , 1991 .

[21]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[22]  Thierry Dutoit,et al.  Glottal closure and opening instant detection from speech signals , 2019, INTERSPEECH.

[23]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[24]  J. Hillenbrand,et al.  Acoustic correlates of breathy vocal quality. , 1994, Journal of speech and hearing research.

[25]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[26]  Bayya Yegnanarayana,et al.  Robust Estimation of Fundamental Frequency Using Single Frequency Filtering Approach , 2016, INTERSPEECH.

[27]  Patrick A. Naylor,et al.  The SIGMA Algorithm: A Glottal Activity Detector for Electroglottographic Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Jong Wook Kim,et al.  Crepe: A Convolutional Representation for Pitch Estimation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Prateek Verma,et al.  Frequency Estimation from Waveforms Using Multi-Layered Neural Networks , 2016, INTERSPEECH.

[30]  Thierry Dutoit,et al.  A comparative study of pitch extraction algorithms on a large variety of singing sounds , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[32]  J P Martens,et al.  Pitch and voiced/unvoiced determination with an auditory model. , 1992, The Journal of the Acoustical Society of America.

[33]  Abeer Alwan,et al.  Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Bin Liu,et al.  A novel pitch extraction based on jointly trained deep BLSTM Recurrent Neural Networks with bottleneck features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Yuji Hisaminato,et al.  A Fast and Accurate Fundamental Frequency Estimator Using Recursive Moving Average Filters , 2016, INTERSPEECH.

[36]  Craig Stuart Sapp,et al.  Efficient Pitch Detection Techniques for Interactive Music , 2001, ICMC.

[37]  Udo Zölzer,et al.  COMPARISON OF PITCH TRACKERS FOR REAL-TIME GUITAR EFFECTS , 2010 .

[38]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[39]  DeLiang Wang,et al.  Neural networks for supervised pitch tracking in noise , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[41]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.