Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

Language varieties (and specifically dialects) are a primary means of expressing a person's social affiliation and identity. Hence, computer systems that can adapt to the user by displaying a familiar socio-cultural identity are expected to raise the acceptance within certain contexts and target groups dramatically. Although the currently prevailing statistical paradigm has made possible major achievements in many areas of natural language processing, the applicability of the available methods is generally limited to major languages / standard varieties, to the exclusion of dialects or varieties that substantially differ from the standard. While there are considerable initiatives dealing with the development of language resources for minor languages, and also reliable methods to handle accents of a given language, i.e., for applications like speech synthesis or recognition, the situation for dialects still calls for novel approaches, methods and techniques to overcome or circumvent the problem of data scarcity, but also to enhance and strengthen the standing that language varieties and dialects have in natural language processing technologies, as well as in interaction technologies that build upon the former. What made us think that a such a workshop would be a fruitful enterprise was our conviction that only joint efforts of researchers with expertise in various disciplines can bring about progress in this field. We therefore aimed in our call to invite and bring together colleagues that deal with topics ranging from machine learning algorithms and active learning, machine translation between language varieties or dialects, speech synthesis and recognition, to issues of orthography, annotation and linguistic modelling. The 2011 Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties (DIALECTS 2011) is the first workshop to be held on this rather interdisciplinary topic. The workshop received seventeen submissions, out of which six were accepted as oral presentations (long papers) and three as posters (short papers). These papers represent interesting work from almost all the scientific fields that were mentioned in the call as being necessary to contribute to the common goal.

[1]  Marián Trnka,et al.  Semi-automatic approach to ASR errors categorization in multi-speaker corpora , 2011 .

[2]  Milos Cernak,et al.  Effective Triphone Mapping for Acoustic Modeling in Speech Recognition , 2011, INTERSPEECH.

[3]  Yves Scherrer,et al.  Word-Based Dialect Identification with Georeferenced Rules , 2010, EMNLP.

[4]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[5]  Walter Daelemans,et al.  Weigh your words - memory-based lemmatization for Middle Dutch , 2010, Lit. Linguistic Comput..

[6]  Carolyn Penstein Rosé,et al.  Using feature construction to avoid large feature spaces in text classification , 2010, GECCO '10.

[7]  Emiliano Raúl Guevara,et al.  NoWaC: a large web-based corpus for Norwegian , 2010, WAC@NAACL-HLT.

[8]  Carolyn Penstein Rosé,et al.  Sentiment Classification using Automatically Extracted Subgraph Features , 2010, HLT-NAACL 2010.

[9]  José João Almeida,et al.  Bigorna – A Toolkit for Orthography Migration Challenges , 2010, LREC.

[10]  Roxana Girju,et al.  Toward Social Causality: An Analysis of Interpersonal Relationships in Online Blogs and Forums , 2010, ICWSM.

[11]  Ari Rappoport,et al.  ICWSM - A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews , 2010, ICWSM.

[12]  Michael Pucher,et al.  Resources for Speech Synthesis of Viennese Varieties , 2010, LREC.

[13]  Yves Scherrer,et al.  Natural Language Processing for the Swiss German Dialect Area , 2010, KONVENS.

[14]  Štefan Beňuš,et al.  Effects of lexical stress and speech rate on the quantity and quality of Slovak vowels , 2010, Speech Prosody 2010.

[15]  Carolyn Penstein Rosé,et al.  Generalizing Dependency Features for Opinion Mining , 2009, ACL.

[16]  Hsinchun Chen,et al.  Gender difference analysis of political web forums: An experiment on an international islamic women's forum , 2009, 2009 IEEE International Conference on Intelligence and Security Informatics.

[17]  Carolyn Penstein Rosé,et al.  Identifying Types of Claims in Online Customer Reviews , 2009, NAACL.

[18]  Christopher D. Manning,et al.  Hierarchical Bayesian Domain Adaptation , 2009, NAACL.

[19]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[20]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[21]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[22]  Sara Mendes Syntax and semantics of adjectives in portuguese analysis and modeling , 2009 .

[23]  Raquel Amaro,et al.  Computation of verbal predicates in portuguese : relational network, lexical-conceptual structure and context : the case of verbs of movement , 2009 .

[24]  Jörg Tiedemann,et al.  Character-Based PSMT for Closely Related Languages , 2009, EAMT.

[25]  Jon Oberlander,et al.  What Are They Blogging About? Personality, Topic and Motivation in Blogs , 2009, ICWSM.

[26]  William W. Cohen,et al.  Exploiting domain and task regularities for robust named entity recognition , 2009 .

[27]  Mans Hulden,et al.  Regular Expressions and Predicate Logic in Finite-State Language Processing , 2009, FSMNLP.

[28]  Pascal Vaillant,et al.  A Layered Grammar Model: Using Tree-Adjoining Grammars to Build a Common Syntactic Kernel for Related Dialects , 2008, TAG.

[29]  Adam Kilgarriff,et al.  A Web Corpus and Word Sketches for Japanese , 2008 .

[30]  Federica Barbieri Patterns of age-based linguistic variation in American English , 2008 .

[31]  Shlomo Argamon,et al.  Political Leaning Categorization by Exploring Subjectivities in Political Blogs , 2008, DMIN.

[32]  Sandra Kübler The PaGe 2008 Shared Task on Parsing German , 2008 .

[33]  J. Kirk Assessing Celticity in a Corpus of Irish Standard English , 2007 .

[34]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[35]  Hermann Ney,et al.  Can We Translate Letters? , 2007, WMT@ACL.

[36]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[37]  John Kirk,et al.  ICE-Ireland: Local Variations on Global Standards , 2007 .

[38]  Sylvia Moosmüller When does lip protrusion start in Standard Austrian German? An acoustic investigation. , 2007 .

[39]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[40]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[41]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[42]  Hong Kook Kim,et al.  Acoustic Model Adaptation Based on Pronunciation Variability Analysis for Non-Native Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[43]  Raquel Amaro,et al.  Enriching Wordnets with New Relations and with Event and Argument Structures , 2006, CICLing.

[44]  N. Campbell,et al.  Conversational speech synthesis and the need for some laughter , 2005, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Xiang Yan,et al.  Gender Classification of Weblog Authors , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[46]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[47]  Rui P. Chaves,et al.  WordNet.PT New Directions , 2006 .

[48]  Heiga Zen,et al.  An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 , 2005, INTERSPEECH.

[49]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[50]  Petr Homola,et al.  A Machine Translation System into a Minority Language , 2005 .

[51]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[52]  M. Volk,et al.  Bootstrapping Parallel Treebanks , 2004, COLING 2004.

[53]  A. McEnery Swearing in English: Bad Language, Purity and Power from 1586 to the Present , 2004 .

[54]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[55]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[56]  M. Gordon Principles of Linguistic Change: Social Factors, Volume 2. , 2003 .

[57]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[58]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[59]  J. Holmes,et al.  The handbook of language and gender , 2003 .

[60]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[61]  Palmira Marrafa,et al.  Portuguese WordNet: general architecture and internal semantic relations , 2002 .

[62]  P. Eckert,et al.  Style and Sociolinguistic Variation. , 2002 .

[63]  Piek Vossen,et al.  EuroWordNet: general document , 2002 .

[64]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[65]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[66]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[67]  Laura Mayfield Tomokiyo,et al.  Lexical and acoustic modeling of non-native speech in LVSCR , 2000, INTERSPEECH.

[68]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[69]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[70]  E. Hinrichs,et al.  The Tübingen Treebanks for Spoken German, English, and Japanese , 2000 .

[71]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[72]  Zvi Penner,et al.  Topics in Swiss German syntax , 1995 .

[73]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[74]  W. Labov Principles of Linguistic Change: Internal Factors , 1994 .

[75]  S. Murray You just don't understand: Women and men in conversation , 1992 .

[76]  P. Eckert,et al.  Think Practically and Look Locally: Language and Gender as Community-Based Practice , 1992 .

[77]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[78]  Mark Johnson,et al.  A Discovery Procedure for Certain Phonological Rules , 1984, ACL.

[79]  P. Delattre A COMPARISON OF SYLLABLE LENGTH CONDITIONING AMONG LANGUAGES , 1966 .