Effectiveness of Dynamic Features in INCA and Temporal Context-INCA

Non-parallel Voice Conversion (VC) has gained significant attention since last one decade. Obtaining corresponding speech frames from both the source and target speakers before learning the mapping function in the non-parallel VC is a key step in the standalone VC task. Obtaining such corresponding pairs, is more challenging due to the fact that both the speakers may have uttered different utterances from same or the different languages. Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) and its variant Temporal Context (TC)-INCA are popular unsupervised alignment algorithms. The INCA and TC-INCA iteratively learn the mapping function after getting the Nearest Neighbor (NN) aligned pairs from the intermediate converted and the target spectral features. In this paper, we propose to use dynamic features along with static features to calculate the NN aligned pairs in both the INCA and TC-INCA algorithms (since the dynamic features are known to play a key role to differentiate major phonetic categories). We obtained on an average relative improvement of 13.75 % and 5.39 % with our proposed Dynamic INCA and Dynamic TC-INCA, respectively. This improvement is also positively reflected in the quality of converted voices.

[1]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Bertrand Delgutte,et al.  Auditory Neural Processing of Speech , 2002 .

[3]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[4]  T.K. Basu,et al.  Detection of bilingual twins by Teager energy based features , 2004, 2004 International Conference on Signal Processing and Communications, 2004. SPCOM '04..

[5]  Lauri Juvela,et al.  Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Seyed Hamidreza Mohammadi,et al.  Voice conversion using deep neural networks with speaker-independent pre-training , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[7]  Hemant A. Patil,et al.  Influence of various asymmetrical contextual factors for TTS in a low resource language , 2014, 2014 International Conference on Asian Language Processing (IALP).

[8]  Seyed Hamidreza Mohammadi,et al.  Semi-supervised training of a voice conversion mapping function using a joint-autoencoder , 2015, INTERSPEECH.

[9]  William J. Byrne,et al.  Convergence Theorems for Generalized Alternating Minimization Procedures , 2005, J. Mach. Learn. Res..

[10]  Koby Crammer,et al.  Non-parallel voice conversion using joint optimization of alignment by temporal context and spectral distortion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[12]  Hemant A. Patil,et al.  Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion , 2017, PReMI.

[13]  Shinnosuke Takamichi,et al.  Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities , 2017, INTERSPEECH.

[14]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[15]  Matthew K. Leonard,et al.  Dynamic speech representations in the human temporal lobe , 2014, Trends in Cognitive Sciences.

[16]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[18]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[19]  D. Poeppel,et al.  Cognitive Processing Asymmetry of Transitions between Order and Disorder in Human Auditory Cortex , 2007 .

[20]  Seyed Hamidreza Mohammadi,et al.  An overview of voice conversion systems , 2017, Speech Commun..

[21]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[22]  Adam M. Croom Auditory Neuroscience: Making Sense of Sound , 2014 .

[23]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[25]  Hemant A. Patil,et al.  On the convergence of INCA algorithm , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[26]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Gunnar Fant,et al.  Speech sounds and features , 1973 .

[28]  Ferath Kherif,et al.  Does Semantic Context Benefit Speech Understanding through “Top–Down” Processes? Evidence from Time-resolved Sparse fMRI , 2011, Journal of Cognitive Neuroscience.

[29]  Hemant A. Patil,et al.  Effectiveness of PLP-based phonetic segmentation for speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[32]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Moncef Gabbouj,et al.  On the impact of alignment on voice conversion performance , 2008, INTERSPEECH.

[34]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[36]  D. Poeppel,et al.  Processing Asymmetry of Transitions between Order and Disorder in Human Auditory Cortex , 2007, The Journal of Neuroscience.

[37]  Hemant A. Patil,et al.  Novel Amplitude Scaling method for bilinear frequency Warping-based Voice Conversion , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Hemant A. Patil,et al.  Novel Pre-processing using Outlier Removal in Voice Conversion , 2016, SSW.

[39]  A. Lotto,et al.  Speech Perception Within an Auditory Cognitive Science Framework , 2008, Current directions in psychological science.

[40]  Avni Rajpal,et al.  Quality assessment of voice converted speech using articulatory features , 2015, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Inma Hernáez,et al.  Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Wolfram Koepf,et al.  Lecture Notes in Computer Science (LNCS) , 2011 .