Challenges and Opportunities in Sociolinguistic Data and Metadata Sharing

Advances in computing technology coupled with recent focus on big data in the social sciences have provided the motivation and some of the infrastructure necessary for sociolinguists to share data among themselves and with researchers in related fields such as human language technologies (HLT). Collaboration among sociolinguists offers the promise to extend current knowledge beyond the community studies that have dominated the field for the past 50 years and focus more on regional and national patterns of variation and change and what they indicate about linguistic theory. Collaboration with HLT developers while relatively new and still uncommon has led to advances both in sociolinguistic methodology and in technologies suited to sociolinguistic research. Before the field can make full use of these advances, however, sociolinguists must confront a number of challenges. Studies that were developed with the intent of describing a single speech community presumably need not ensure, and in many cases have not ensured, consistency with prior work. Given this practice, attempts to compare phenomena across studies must address mismatches at the levels of data elicitation and selection, coding practice, and the definition of underlying concepts. Adding to the confusion wrought by methodological differences, speech communities differ in ways that the field worker cannot always predict so that different and sometimes unique linguistic and non-linguistic features are found to vary with linguistic structure. This paper underscores the motivation for data sharing by identifying some limitations of comparisons based only on published papers and reviewing advances fueled by data sharing among linguists and between linguists and technology developers. It also documents some of the challenges that hinder data sharing by reviewing work that has build upon available corpora. Finally, it summarizes efforts outside of sociolinguistics that have proposed frameworks for sharing and comparing metadata and categories setting the stage for the papers that follow in these special issues.

[1]  Heike Zinsmeister,et al.  The ALeSKo learner corpus: Design – annotation – quantitative analyses , 2012 .

[2]  James A. Walker,et al.  Ethnolects and the city: Ethnic orientation and linguistic variation in Toronto English , 2010, Language Variation and Change.

[3]  Lesley Milroy,et al.  Chapter 22. Social Networks , 2008 .

[4]  Miriam Meyerhoff Chapter 21. Communities of Practice , 2008 .

[5]  M. Schlesewsky,et al.  Gradience in Grammar , 2006 .

[6]  Alvin F. Martin,et al.  NIST Speaker Recognition Evaluation Chronicles - Part 2 , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[7]  Alvin F. Martin,et al.  NIST speaker recognition evaluation chronicles , 2004, Odyssey.

[8]  Michelle A. Fox,et al.  Syllable-final /s/ lenition in the LDC's callhome Spanish corpus , 2000, INTERSPEECH.

[9]  Mary Bucholtz,et al.  “Why be normal?”: Language and identity practices in a community of nerd girls , 1999, Language in Society.

[10]  Gregory R. Guy,et al.  Inherent variability and the obligatory contour principle , 1997, Language Variation and Change.

[11]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[12]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Gregory R. Guy Explanation in variable phonology: An exponential model of morphological constraints , 1991, Language Variation and Change.

[14]  Gregory R. Guy,et al.  The development of a morphological class , 1990, Language Variation and Change.

[15]  William Labov,et al.  The child as linguistic historian , 1989, Language Variation and Change.

[16]  A. Bell Language style as audience design , 1984, Language in Society.

[17]  W. Labov Contraction, Deletion, and Inherent Variability of the English Copula. , 1969 .

[18]  Joseph P. Campbell,et al.  Characterizing Phonetic Transformations and Acoustic Differences Across English Dialects , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[20]  Marek Grác,et al.  Rapid Development of Language Resources , 2013 .

[21]  S. Kawahara,et al.  Frequency biases in phonological variation , 2013 .

[22]  Josef Fruehwald,et al.  Redevelopment of a Morphological Class , 2012 .

[23]  Heike Zinsmeister,et al.  Starting a Sentence in L2 German - Discourse Annotation of a Learner Corpus , 2010, KONVENS.

[24]  Christopher Cieri,et al.  Modeling phonological variation in multidialectal Italy , 2005 .

[25]  Steven Greenberg,et al.  INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS , 1996 .