Language and Gender Classification of Speech Files Using Supervised Machine Learning Methods

ABSTRACT Many language identification (LID) systems are based on language models using techniques that consider the fluctuation of speech over time. Considering these fluctuations necessitates longer recording intervals to obtain reasonable accuracy. Our research extracts features from short recording intervals to enable successful classification of spoken language. The feature extraction process is based on frames of 20 ms, whereas most previous LIDs presented results based on much longer frames (3 s or longer). We defined and implemented 200 features divided into four feature sets: cepstrum features, RASTA features, spectrum features, and waveform features. We applied eight machine learning (ML) methods on the features that were extracted from a corpus containing speech files in 10 languages from the Oregon Graduate Institute (OGI) telephone speech database and compared their performances using extensive experimental evaluation. The best optimized classification results were achieved by random forest (RF): from 76.29% on 10 languages to 89.18% on 2 languages. These results are better or comparable to the state-of-the-art results for the OGI database. Another set of experiments that was performed was gender classification from 2 to 10 languages. The accuracy and the F measure values for the RF method for all the language experiments were greater than or equal to 90.05%.

[1]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[2]  Bin Ma,et al.  Prosodic attribute model for spoken language identification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Xi Yang,et al.  An N-Gram-and-Wikipedia joint approach to Natural Language Identification , 2010, 2010 4th International Universal Communication Symposium.

[4]  Victor Zue,et al.  Automatic language identification using a segment-based approach , 1993, EUROSPEECH.

[5]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[6]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[7]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[8]  Pat Langley,et al.  Induction of One-Level Decision Trees , 1992, ML.

[9]  Jean-Luc Gauvain,et al.  Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[11]  Itahashi Shuichi,et al.  Language identification based on speech fundamental frequency , 1995, EUROSPEECH.

[12]  A. Abramson A Practical Introduction to Phonetics (review) , 2003 .

[13]  J. Cernocký,et al.  Tuning Phonotactic Language Identificaion System ∗ , .

[14]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[15]  Ingo R. Titze,et al.  Principles of voice production , 1994 .

[16]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[17]  EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, September 3-7, 2001 , 2001, INTERSPEECH.

[18]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[19]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[20]  Steve Young,et al.  The HTK book , 1995 .

[21]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[22]  Rafael Dueire Lins,et al.  Automatic language identification of written texts , 2004, SAC '04.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Vishal Gupta,et al.  A Survey of Language Identification Techniques and Applications , 2014 .

[25]  François Pellegrino,et al.  Automatic language identification: an alternative approach to phonetic modelling , 2000, Signal Process..

[26]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[27]  BarnardEtienne,et al.  Factors that affect the accuracy of text-based language identification , 2012 .

[28]  Eibe Frank,et al.  Speeding Up Logistic Model Tree Induction , 2005, PKDD.

[29]  H. Johnson,et al.  A comparison of 'traditional' and multimedia information systems development practices , 2003, Inf. Softw. Technol..

[30]  David Heckerman,et al.  Bayesian Networks for Data Mining , 2004, Data Mining and Knowledge Discovery.

[31]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[32]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[33]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[34]  Ali Selamat,et al.  Improved N-grams Approach for Web Page Language Identification , 2011, Trans. Comput. Collect. Intell..

[35]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[37]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[38]  Douglas A. Reynolds,et al.  Language identification using Gaussian mixture model tokenization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[40]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[41]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[42]  Sandra E. Hutchins,et al.  On using prosodic cues in automatic language identification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[43]  Olivier Pourret,et al.  Bayesian networks : a practical guide to applications , 2008 .

[44]  Régine André-Obrecht,et al.  A new statistical approach for the automatic segmentation of continuous speech signals , 1988, IEEE Trans. Acoust. Speech Signal Process..

[45]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[46]  Etienne Barnard,et al.  Factors that affect the accuracy of text-based language identification , 2012, Comput. Speech Lang..

[47]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[48]  Ronald A. Cole,et al.  Automatic Language Identification Using Telephone Speech , 1994 .

[49]  Hongbing Hu,et al.  A spectral/temporal method for robust fundamental frequency tracking. , 2008, The Journal of the Acoustical Society of America.

[50]  Andrea Omicini,et al.  Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), Nicosia, Cyprus, March 14-17, 2004 , 2004, SAC.

[51]  P. Ladefoged A course in phonetics , 1975 .

[52]  M. Zissman Automatic Language Identification of Telephone Speech , 1993 .

[53]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[54]  Tapio Elomaa,et al.  An Analysis of Reduced Error Pruning , 2001, J. Artif. Intell. Res..

[55]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[56]  Ronald A. Cole,et al.  The OGI multi-language telephone speech corpus , 1992, ICSLP.

[57]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[58]  P. Nagabhushan,et al.  Language Identification from an Indian Multilingual Document Using Profile Features , 2009, 2009 International Conference on Computer and Automation Engineering.

[59]  J. R. Landis,et al.  An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.

[60]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[61]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Ronald A. Cole,et al.  A comparison of approaches to automatic language identification using telephone speech , 1993, EUROSPEECH.

[63]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[64]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[65]  Marc A. Zissman,et al.  Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[67]  Yaakov HaCohen-Kerner,et al.  Automatic Classification of Spoken Languages using Diverse Acoustic Features , 2015, PACLIC.

[68]  E. Zwicker,et al.  Subdivision of the audible frequency range into critical bands , 1961 .