Automatic Conversion of Dialectal Tamil Text to Standard Written Tamil Text Using Fsts Rules, Analogy, and Social Factors Codetermine Past-tense Formation Patterns in English Revisiting Word Neighborhoods for Speech Recognition

ii Preface These proceedings contain the contributions presented at the MORPHFSM workshop held on June 27, 2014 in conjunction with ACL 2014 in Baltimore, Maryland, USA. The workshop was a joint meeting of two special interest groups of the ACL: • SIGMORPHON — the special interest group in computational morphology, phonology and phonetics, and • SIGFSM — the special interest group on finite-state methods. It was the thirteenth meeting of SIGMORPHON and an off-year event for SIGFSM. The full-day workshop consisted of an invited presentation by JASON EISNER, contributed presentations, and a special panel session on open problems. The workshop covered a wide range of topics from theoretical to applied morphology and finite-state technology in natural language processing. This volume contains the 7 regular and 1 panel paper that were presented at the workshop. In total, 12 papers (10 regular and 2 panel papers) were submitted to a doubly blind refereeing process, in which each paper was reviewed by 3 program committee members. The overall acceptance rate was 67%. The program committee was composed of internationally leading researchers and practitioners selected from academia, research labs, and companies. The organizing committee would like to thank the program committee for their hard work and valuable feedback, the invited speaker JASON EISNER for his innovative and inspiring keynote, our panelists for their interesting discussion and expertise, the local organizers for their tireless efforts, the ACL administration for their support, and last but not least the authors for their contributions. Abstract Word neighborhoods have been suggested but not thoroughly explored as an explanatory variable for errors in automatic speech recognition (ASR). We revisit the definition of word neighborhoods, propose new measures using a fine-grained artic-ulatory representation of word pronunciations , and consider new neighbor weight-ing functions. We analyze the significance of our measures as predictors of errors in an isolated-word ASR system and a continuous-word ASR system. We find that our measures are significantly better predictors of ASR errors than previously used neighborhood density measures.

[1]  J. Berko The Child's Learning of English Morphology , 1958 .

[2]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[3]  B. Cohn,et al.  Structure and change in Indian society , 1969 .

[4]  Joan L. Bybee,et al.  Rules and schemas in the development and use of the English past tense , 1982 .

[5]  Carol Lynn Moder,et al.  Morphological Classes as Natural Categories , 1983 .

[6]  Harold F. Schiffman,et al.  A Reference Grammar of Spoken Tamil , 1983 .

[7]  R. Nosofsky Relations between exemplar-similarity and likelihood models of classification , 1990 .

[8]  Carol Lynn Moder Productivity and categorization in morphological classes , 1992 .

[9]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[10]  Edward Carney,et al.  A Survey of English Spelling , 1993 .

[11]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[12]  Harold F. Schiffman Standardization or restandardization: The case for “Standard” Spoken Tamil , 1998, Language in Society.

[13]  K. Plunkett,et al.  A cross-linguistic comparison of single and dual-route models of inflectional morphology , 2000 .

[14]  Todd M. Bailey,et al.  Determinants of wordlikeness: Phonotactics or lexical neighborhoods? , 2001 .

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Ilyas Cicekli,et al.  A Morphological Analyser for Crimean Tatar , 2001 .

[17]  K. Rice Principles of Linguistic Change (Volume 2): Social Factors: William Labov , 2002 .

[18]  Bruce Hayes,et al.  Modeling English Past Tense Intuitions with Minimal Generalization , 2002, SIGMORPHON.

[19]  Yaser Al-Onaizan,et al.  Machine Transliteration of Names in Arabic Texts , 2002, SEMITIC@ACL.

[20]  James L. McClelland,et al.  Rules or connections in past-tense inflections: what does the evidence rule out? , 2002, Trends in Cognitive Sciences.

[21]  Lauri Karttunen,et al.  Finite State Morphology , 2003, CSLI Studies in Computational Linguistics.

[22]  B. Hayes,et al.  Rules vs. analogy in English past tenses: a computational/experimental study , 2003, Cognition.

[23]  Janet B. Pierrehumbert,et al.  Similarity Avoidance and the OCP , 2004 .

[24]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[25]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[26]  Moshe Koppel,et al.  Determining an author's native language by mining a text for errors , 2005, KDD '05.

[27]  Grzegorz Kondrak Cognates and Word Alignment in Bitexts , 2005, MTSUMMIT.

[28]  Joshua K. Hartshorne,et al.  Why girls say 'holded' more than boys. , 2006, Developmental science.

[29]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[30]  Alan W. Black,et al.  Learning Pronunciation Dictionaries: Language Complexity and Word Selection Strategies , 2006, NAACL.

[31]  Kemal Oflazer,et al.  Computer Analysis of the Turkmen Language Morphology , 2006, FinTAL.

[32]  Janet B. Pierrehumbert,et al.  The next toolkit , 2006, J. Phonetics.

[33]  Ari Rappoport,et al.  Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words , 2007 .

[34]  Kemal Oflazer,et al.  A MT system from Turkmen to Turkish employing finite state and statistical methods , 2007, MTSUMMIT.

[35]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[36]  Grzegorz Kondrak,et al.  Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion , 2008, ACL.

[37]  Murat Orhun,et al.  Computational comparison of the Uyghur and Turkish Grammar , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[38]  Colin Cherry,et al.  Discriminative Substring Decoding for Transliteration , 2009, EMNLP.

[39]  Kevin Knight,et al.  Learning Phoneme Mappings for Transliteration without Parallel Data , 2009, HLT-NAACL.

[40]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[41]  Victor Kuperman,et al.  Crowdsourcing and language studies: the new generation of linguistic data , 2010, Mturk@HLT-NAACL.

[42]  Çağrı Çöltekin,et al.  A Freely Available Morphological Analyzer for Turkish , 2010, LREC.

[43]  S. Frisch,et al.  Metalinguistic judgments of phonotactics by monolinguals and bilinguals , 2010 .

[44]  E. Umamaheswari,et al.  Enhancement of Morphological analyzer with compound, numeral and colloquial word handler , 2011 .

[45]  Markus Dreyer,et al.  Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model , 2011, EMNLP.

[46]  Grzegorz Kondrak,et al.  Leveraging supplemental representations for sequential transduction , 2012, NAACL.

[47]  Grzegorz Kondrak,et al.  Word similarity, cognation, and translational equivalence , 2012 .

[48]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[49]  Francis M. Tyers,et al.  A Free/Open-source Kazakh-Tatar Machine Translation System , 2013, MTSUMMIT.

[50]  John DeNero,et al.  Supervised Learning of Complete Morphological Paradigms , 2013, NAACL.

[51]  Noah A. Smith,et al.  Translating into Morphologically Rich Languages with Synthetic Phrases , 2013, EMNLP.

[52]  Grzegorz Kondrak,et al.  Does the Phonology of L1 Show Up in L2 Texts? , 2014, ACL.

[53]  Lisa Garnand Dawdy-Hesterberg,et al.  Learnability and generalisation of Arabic broken plural nouns , 2014, Language, cognition and neuroscience.

[54]  Grzegorz Kondrak,et al.  Solving Substitution Ciphers with Combined Language Models , 2014, COLING.

[55]  Grzegorz Kondrak,et al.  Lattice Desegmentation for Statistical Machine Translation , 2014, ACL.

[56]  Report of NEWS 2012 Machine Transliteration Shared Task , 2012, NEWS@ACL.

[57]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[58]  Unsupervised Word Segmentation for Bangla , .