The automatic identification of lexical variation between language varieties

Languages are not uniform. Speakers of different language varieties use certain words differently – more or less frequently, or with different meanings. We argue that distributional semantics is the ideal framework for the investigation of such lexical variation. We address two research questions and present our analysis of the lexical variation between Belgian Dutch and Netherlandic Dutch. The first question involves a classic application of distributional models: the automatic retrieval of synonyms. We use corpora of two different language varieties to identify the Netherlandic Dutch synonyms for a set of typically Belgian words. Second, we address the problem of automatically identifying words that are typical of a given lect, either because of their high frequency or because of their divergent meaning. Overall, we show that distributional models are able to identify more lectal markers than traditional keyword methods. Distributional models also have a bias towards a different type of variation. In summary, our results demonstrate how distributional semantics can help research in variational linguistics, with possible future applications in lexicography or terminology extraction.

[1]  José Ramom Pichel Campos,et al.  Learning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary , 2008, CICLing.

[2]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[3]  D. Geeraerts,et al.  Advances in cognitive sociolinguistics , 2010 .

[4]  Marie Louise Elizabeth van der Plas,et al.  Automatic lexico-semantic acquisition for question answering , 2008 .

[5]  Adam Kilgarriff,et al.  Language is never, ever, ever, random , 2005 .

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  D. Geeraerts,et al.  Convergentie en divergentie in de Nederlandse woordenschat: een onderzoek naar kleding- en voetbaltermen , 1999 .

[8]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[9]  Mike Scott,et al.  PC analysis of key words — And key key words , 1997 .

[10]  Curt Burgess,et al.  Explorations in context space: Words, sentences, discourse , 1998 .

[11]  Dirk Speelman,et al.  Profile-Based Linguistic Uniformity as a Generic Method for Comparing Language Varieties , 2003, Comput. Humanit..

[12]  Willy Martin,et al.  Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch-Nederlands (RBBN) , 2007 .

[13]  Paul Rayson,et al.  Extending the Cochran rule for the comparison of word frequencies between corpora , 2004 .

[14]  Yves Peirsman,et al.  Finding semantically related words in Dutch: co-occurrences versus syntactic contexts , 2007 .

[15]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[16]  Stefan Th. Gries,et al.  Ways of trying in Russian: clustering behavioral profiles , 2006, Corpus Linguistics and Linguistic Theory.

[17]  Tom Michael Mitchell,et al.  Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[18]  Ido Dagan,et al.  Articles: Bootstrapping Distributional Feature Vector Quality , 2009, CL.

[19]  Paula Buttery,et al.  Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition , 2007 .

[20]  Erkki Sutinen,et al.  Automatic Essay Grading with Probabilistic Latent Semantic Analysis , 2005 .

[21]  Anatol Stefanowitsch,et al.  Corpora in cognitive linguistics : corpus-based approaches to syntax and lexis , 2006 .

[22]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[23]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[24]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[25]  W. Kintsch Metaphor comprehension: A computational theory , 2000, Psychonomic bulletin & review.

[26]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[27]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[28]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[29]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[30]  Ann Bertels Sémantique quantitative et corpus technique : des analyses statistiques aux interprétations linguistiques , 2008 .

[31]  A. Dister,et al.  Le Poids des mots. Actes des 7es Journées internationales d’Analyse Statistique des Données Textuelles (JADT 2004) , 2004 .

[32]  D. Geeraerts,et al.  The English genitive alternation in a cognitive sociolinguistics perspective , 2010 .

[33]  Eyal Sagi,et al.  Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space , 2009 .

[34]  Peter Auer,et al.  Language and space : an international handbook of linguistic variation , 2009 .

[35]  Franciska de Jong,et al.  TwNC: a Multifaceted Dutch News Corpus , 2007 .

[36]  T. Van de Cruys,et al.  A Comparison of Bag of Words and Syntax-based Approaches for Word Categorization , 2008 .

[37]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[38]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[39]  Dirk Geeraerts Lexical variation in space , 2010 .

[40]  W. Lowe,et al.  The Direct Route: Mediated Priming in Semantic Space , 2000 .

[41]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[42]  Jian-Yun Nie,et al.  Query expansion using term relationships in language models for information retrieval , 2005, CIKM '05.

[43]  Stefanie Wulff,et al.  Brutal Brits and persuasive Americans: Variety-specifc meaning construction in theinto-causative , 2007 .

[44]  Graeme Hirst,et al.  Cross-Lingual Distributional Profiles of Concepts for Measuring Semantic Distance , 2007, EMNLP.

[45]  Peter W. Foltz,et al.  Latent semantic analysis for text-based research , 1996 .

[46]  Kris Heylen,et al.  Usage-based approaches in Cognitive Linguistics: A technical state of the art , 2005 .

[47]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[48]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[49]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[50]  Yves Peirsman,et al.  Predicting Strong Associations on the Basis of Corpus Data , 2009, EACL.

[51]  J. R. Firth,et al.  Studies in Linguistic Analysis. , 1974 .

[52]  Valentin Jijkoun,et al.  Recognizing Textual Entailment: Is Word Similarity Enough? , 2005, MLCW.

[53]  Alessandro Lenci,et al.  ISA meets Lara: An incremental word space model for cognitively plausible simulations of semantic learning , 2007, ACL 2007.

[54]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[55]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.