AixOx, a multi-layered learners' corpus: automatic annotation

This paper presents a multilingual learners corpus, AixOx, collect-ed in the framework of an Alliance project (a partnership between the British Council and The French Ministry of Foreign Affairs). The corpus consists of the recording of 40 1-minute passages in English and French from the Eurom 1 corpus (Chan et al., 1995), read by native speakers and L2 learners. French native speakers reading the French and English passages were recorded in Aix-en-Provence, and English native speakers reading the English and French passages were recorded in Oxford. The AixOx corpus con-tains about 40 hours of read speech and can be downloaded from the “Speech and Language Data Repository” (http://sldr.org). This paper also presents the tools used for automatic anno-tation on several layers using algorithms: •SPPAS –SPeech Phonetization Alignment and Syllabifica-tion– (Bigi, 2012) for a segmentation into utterances, words, syllables and phonemes; •MoMel –Modelling Melody– and INTSINT –INternational Transcription System for INTonation– (Hirst, 2007) for the modelling and coding of intonation. Finally, an example of a pedagogical application of the cor-pus is given: a pilot-study on the intonation of questions. We show how the AixOx corpus can be used to compare the produc-tions of natives with learners and how it is possible, thanks to the annotation, to understand the prosodic realisations (whether they be positive or negative) and explain them. We conclude that AixOx, with its multi-layered annotation, is a very rich oral data-base for all kinds of studies on L1 productions, L2 productions, language contact, both at the segmental and supra-segmental levels since it offers a phonemic segmentation and alignment and a pro-sodic labelling.

[1]  Daniel Hirst,et al.  SPeech Phonetization Alignment and Syllabification (SPPAS): a tool for the automatic analysis of speech prosody , 2012 .

[2]  R. Espesser,et al.  Le CID - Corpus of Interactional Data. Annotation et exploitation multimodale de parole conversationnelle [The “Corpus of Interactional Data” (CID) - Multimodal annotation of conversational speech”] , 2008, ICON.

[3]  Peter Roach,et al.  Daniel Jones' English pronouncing dictionary. 17th edition. , 2006 .

[4]  Bas Aarts,et al.  Investigating English around the world: The International Corpus of English , 1999 .

[5]  John C. Wells,et al.  English Intonation : An Introduction , 2006 .

[6]  Daniel Hirst,et al.  A PRAAT PLUGIN FOR MOMEL AND INTSINT WITH IMPROVED ALGORITHMS FOR MODELLING AND CODING INTONATION. , 2007 .

[7]  Alex Housen,et al.  A corpus-based study of the L2-acquisition of the English verb system , 2002 .

[8]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[9]  Sylviane Granger,et al.  Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching , 2002 .

[10]  Elisabeth Delais-Roussarie,et al.  La Prosodie des énoncés interrogatifs en français langue seconde (The prosody of questions in French as L2) [in French] , 2012, JEP/TALN/RECITAL.

[11]  R. Quirk,et al.  A Corpus of English Conversation , 1980 .

[12]  Sidney Greenbaum,et al.  Comparing English worldwide : the International Corpus of English , 1996 .

[13]  J. O'connor Intonation Of Colloquial English , 1961 .

[14]  Daniel Jones,et al.  English Pronouncing Dictionary , 1917 .

[15]  Daniel Hirst,et al.  The Aix-MARSEC project: an evolutionary database of spoken British English and automatic tools , 2004 .

[16]  Carmen Muñoz,et al.  Age and the rate of foreign language learning , 2006 .

[17]  Anne Tortel ANGLISH. Une base de données comparatives de l’anglais lu, répété et parlé en L1 & L2 , 2008 .

[18]  P. Delattre Les Dix Intonations de base du francais. , 1966 .

[19]  Roxane Bertrand,et al.  Orthographic Transcription: which enrichment is required for phonetization? , 2012, LREC.

[20]  Ulrike Gut,et al.  A Prosodic Corpus of Non-Native Speech , 2002 .

[21]  Daniel Hirst,et al.  The analysis by synthesis of speech melody: from data to models. , 2011 .

[22]  Angelika Breiteneder,et al.  VOICE Recording - Methodological Challenges in the Compilation of a Corpus of Spoken ELF , 2006 .

[23]  Florence Myles,et al.  French learner language oral corpora , 2008 .

[24]  Michael Halliday,et al.  Intonation and Grammar in British English , 1967 .

[25]  Dafydd Gibbon,et al.  EUROM - a spoken language resource for the EU - the SAM projects , 1995, EUROSPEECH.

[26]  Elisabeth Delais-Roussarie,et al.  The prosody of interrogatives in French 1 , 2007 .

[27]  D. Hirst,et al.  Rhythm metrics and the productions of L1/L2 , 2010 .

[28]  Hermann Moisl,et al.  “A Linguistic ‘Time Capsule’: the Newcastle Electronic Corpus of Tyneside English”. , 2007 .

[29]  Anne Tortel ÉVALUATION QUALITATIVE DE LA PROSODIE D'APPRENANTS FRANÇAIS: APPORT DE PARAMÉTRISATION PROSODIQUES , 2009 .

[30]  Sylviane Granger,et al.  The International Corpus of Learner English , 1993 .

[31]  Gabriel Bergounioux Etude Socio-Linguistique sur Orléans (1966-1970) , 1996 .

[32]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[33]  Brigitte Bigi,et al.  SPPAS: a tool for the phonetic segmentation of speech , 2012, LREC.

[34]  Eugene Mollet Age and the Rate of Foreign Language Learning, Carmen Muñoz (Ed.), in: Multilingual Matters. Clevedon (2006), 296 pages , 2007 .

[35]  Daniel Hirst,et al.  Automatic modelling of fundamental frequency using a quadratic sline function , 1993 .

[36]  Anastassia Loukina,et al.  Rhythm measures and dimensions of durational variation in speech. , 2011, The Journal of the Acoustical Society of America.

[37]  Brigitte Bigi,et al.  A Multilingual Text Normalization Approach , 2011, LTC.

[38]  Anne Tortel ANGLISH : base de données comparatives L1 & L2 de l’anglais lu, répété et parlé , 2008 .

[39]  Fernand Carton,et al.  Introduction à la phonétique du français , 1974 .

[40]  Bas Aarts,et al.  Exploring Natural Language: Working with the British Component of the International Corpus of English , 2002 .

[41]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[42]  Céline Horgues Prosodie de l'accent français en anglais et perception par des auditeurs anglophones , 2010 .

[43]  J. Durand,et al.  La phonologie du français contemporain : usages, variétés et structures , 2001 .

[44]  Roxane Bertrand,et al.  Automatic Detection of Syllable Boundaries in Spontaneous Speech , 2010, LREC.