Multi-view Dimensionality Reduction for Dialect Identification of Arabic Broadcast Speech

In this work, we present a new Vector Space Model (VSM) of speech utterances for the task of spoken dialect identification. Generally, DID systems are built using two sets of features that are extracted from speech utterances; acoustic and phonetic. The acoustic and phonetic features are used to form vector representations of speech utterances in an attempt to encode information about the spoken dialects. The Phonotactic and Acoustic VSMs, thus formed, are used for the task of DID. The aim of this paper is to construct a single VSM that encodes information about spoken dialects from both the Phonotactic and Acoustic VSMs. Given the two views of the data, we make use of a well known multi-view dimensionality reduction technique known as Canonical Correlation Analysis (CCA), to form a single vector representation for each speech utterance that encodes dialect specific discriminative information from both the phonetic and acoustic representations. We refer to this approach as feature space combination approach and show that our CCA based feature vector representation performs better on the Arabic DID task than the phonetic and acoustic feature representations used alone. We also present the feature space combination approach as a viable alternative to the model based combination approach, where two DID systems are built using the two VSMs (Phonotactic and Acoustic) and the final prediction score is the output score combination from the two systems.

[1]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Daoqiang Zhang,et al.  Multi-view dimensionality reduction via canonical random correlation analysis , 2015, Frontiers of Computer Science.

[3]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[4]  Daniel P. W. Ellis,et al.  Dialect and Accent Recognition Using Phonetic-Segmentation Supervectors , 2011, INTERSPEECH.

[5]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[7]  Marc A. Zissman,et al.  Automatic language identification , 2001, Speech Commun..

[8]  Samantha Wray,et al.  Crowdsource a little to label a lot: labeling a speech corpus of dialectal Arabic , 2015, INTERSPEECH.

[9]  Saleem Zaroubi,et al.  2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) , 2012, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  James R. Glass,et al.  A complete KALDI recipe for building Arabic speech recognition systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[11]  B. Moore Principal component analysis in linear systems: Controllability, observability, and model reduction , 1981 .

[12]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[13]  Mike Rosner,et al.  Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages , 1998 .

[14]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[15]  Yu Zhang,et al.  Speaker adaptation using the i-vector technique for bottleneck features , 2015, INTERSPEECH.

[16]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[17]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[18]  Lukás Burget,et al.  iVector-based prosodic system for language identification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  D. Wilks Canonical Correlation Analysis (CCA) , 2011 .

[20]  Dean P. Foster Multi-View Dimensionality Reduction via Canonical Correlation Multi-View Dimensionality Reduction via Canonical Correlation Analysis Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimensionality Reduction via Canonical Correlation Analysis Multi-View Dimen , 2008 .

[21]  Nizar Habash,et al.  Spoken Arabic Dialect Identification Using Phonotactic Modeling , 2009, SEMITIC@EACL.

[22]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[23]  Julia Hirschberg,et al.  Automatic Dialect and Accent Recognition and its Application to Speech Recognition , 2011 .

[24]  Dean P. Foster,et al.  Eigenwords: spectral word embeddings , 2015, J. Mach. Learn. Res..