Phonetic differences for dialect clustering

In this paper we investigate differences and similarities between dialects using unsupervised learning. We used a binary phonetic representation to cluster utterances from different Arabic and English dialects. This phonetic representation aims to capture phonetic patterns such as vowel and consonant length. We tested this representation on an Arabic dataset containing utterances from speakers of four dialects: Egyptian, Gulf, Levantine, and North African. We validate our approach on an English dataset containing utterances from speakers from Bradford, Cardiff, Dublin, and Liverpool.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Cor J. Veenman,et al.  The nearest subclass classifier: a compromise between the nearest mean and nearest neighbor classifier , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.

[4]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[5]  Yaser Al-Onaizan,et al.  Improved Sentence-Level Arabic Dialect Classification , 2014, VarDial@COLING.

[6]  Mike Kestemont,et al.  Stylometry with R: A Package for Computational Text Analysis , 2016, R J..

[7]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[8]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[9]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[10]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Nizar Habash,et al.  Spoken Arabic Dialect Identification Using Phonotactic Modeling , 2009, SEMITIC@EACL.

[12]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[13]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[14]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[15]  Julia Hirschberg,et al.  Using prosody and phonotactics in Arabic dialect identification , 2009, INTERSPEECH.

[16]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[17]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[18]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[19]  Radu Tudor Ionescu,et al.  Learning to Identify Arabic and German Dialects using Multiple Kernels , 2017, VarDial.

[20]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[21]  Michael Philip Oakes,et al.  Arud, the Metrical System of Arabic Poetry, as a Feature Set for Authorship Attribution , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[22]  Radu Tudor Ionescu,et al.  UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels , 2016, VarDial@COLING.

[23]  Abdulhadi Shoufan,et al.  Natural Language Processing for Dialectical Arabic: A Survey , 2015, ANLP@ACL.

[24]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[25]  Shervin Malmasi,et al.  Arabic Dialect Identification Using iVectors and ASR Transcripts , 2017, VarDial.

[26]  Shervin Malmasi,et al.  Arabic Dialect Identification in Speech Transcripts , 2016, VarDial@COLING.

[27]  Ryan Cotterell,et al.  A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[28]  William A. Kretzschmar,et al.  Introducing Computational Techniques in Dialectometry , 2003, Comput. Humanit..

[29]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[30]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[31]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[32]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[33]  Fatiha Sadat,et al.  Automatic Identification of Arabic Language Varieties and Dialects in Social Media , 2014, SocialNLP@COLING.

[34]  Francis Nolan,et al.  IVie - a comparative transcription system for intonational variation in English , 1998, ICSLP.