Automatic detection of English inclusions in mixed-lingual text with an application to parsing

The influence of English continues to grow to the extent that its expressions have begun to permeate the original forms of other languages. It has become more acceptable, and in some cases fashionable, for people to combine English phrases with their native tongue. This language mixing phenomenon typically occurs initially in conversation and subsequently in written form. In fact, there is evidence to suggest that currently at least one third of the advertising slogans used in Germany contain English words. The expansion of the Internet, coupled with an increased availability of electronic documents in various languages, has resulted in greater attention being paid to multilingual and language independent applications. However, the automatic identification of foreign expressions, be they words or named entities, is beyond the capability of existing language identification techniques. This failure has inspired a recent growth in the development of new techniques capable of processing mixed-lingual text. This thesis presents an annotation-free classifier designed to identify English inclusions in other languages. The classifier consists of four sequential modules being pre-processing, lexical lookup, search engine classification and post-processing. These modules collectively identify English inclusions and are robust enough to work across different languages, as is demonstrated with German and French. However, its major advantage is its annotation-free characteristics. This means that it does not need any training, a step that normally requires an annotated corpus of examples. The English inclusion classifier presented in this thesis is the first of its type to be evaluated using real-world data. It has been shown to perform well on unseen data in both different languages and domains. Comparisons are drawn between this system and the two leading alternative classification techniques. This system compares favourably with the recently developed alternative technique of combined dictionary and n-gram based classification and is shown to have significant advantages over a trained machine learner. This thesis demonstrates why English inclusion classification is beneficial through a series of real-world examples from different fields. It quantifies in detail the difficulty that existing parsers have in dealing with English expressions occurring in foreign language text. This is underlined by a series of experiments using both a treebank-induced and a hand-crafted grammar based German parser. It will be shown that interfacing

[1]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[2]  Michael D. Picone Anglicisms, neologisms and dynamic French , 1996 .

[3]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[4]  John Kluempers,et al.  A History of the German Language , 1967 .

[5]  Werner Betz Der Einfluss des Lateinischen auf den althochdeutschen Sprachschatz , 1936 .

[6]  Beat Pfister,et al.  From multilingual to polyglot speech synthesis , 1999, EUROSPEECH.

[7]  Carol Neidle,et al.  Lexical Functional Grammar , 1998 .

[8]  X YingGuoPeiShengJiaoYuChuBanYou Longman Dictionary of Contemporary English , 1991 .

[9]  Gregory Grefenstette,et al.  Estimation of English and non-English Language Use on the WWW , 2000, RIAO.

[10]  Robert Eklund,et al.  Xenophones Revisited: Linguistic and other underlying factors affecting the pronunciation of foreign items in Swedish , 1999 .

[11]  Claudia Barolo,et al.  A general approach to TTS reading of mixed-language texts , 2004, INTERSPEECH.

[12]  John Dunn Face control, electronic soap and the four-storey cottage with a jacuzzi: anglicisation, globalisation and the creation of linguistic difference , 2008 .

[13]  Rachael Corr,et al.  Anglicisms in German Computing Terminology , 2003 .

[14]  Amit Dubey,et al.  What to Do When Lexicalization Fails: Parsing German with Suffix Analysis and Smoothing , 2005, ACL.

[15]  Robert Eklund,et al.  ( ) or ( ) or Perhaps Something In-between? Recapping Three Years of Xenophone Studies , 2000 .

[16]  Susan Fitt,et al.  Processing Unfamiliar Words: A Study in the Perception and Production of Native and Foreign Placenam , 1998 .

[17]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[18]  Manfred Görlach,et al.  A dictionary of European anglicisms : a usage dictionary of anglicisms in sixteen European languages , 2001 .

[19]  Mirosław Jabłoński Regularität und Variabilität in der Rezeption englischer Internationalismen im modernen Deutsch, Französisch und Polnisch : aufgezeigt in den Bereichen Sport, Musik und Mode , 1990 .

[20]  Werner Besch,et al.  Britisches Englisch und amerikanisches Englisch/Deutsch , 2004 .

[21]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[22]  Frank Keller,et al.  Using Foreign Inclusion Detection to Improve Parsing Performance , 2007, EMNLP.

[23]  Yamuna Kachru,et al.  Mixers lyricing in Hinglish: blending and fusion in Indian pop culture , 2006 .

[24]  Malvina Nissim,et al.  Using the Web in Machine Learning for Other-Anaphora Resolution , 2003, EMNLP.

[25]  Harald Romsdorfer,et al.  A Mixed-Lingual Phonological Component Which Drives the Statistical Prosody Control of a Polyglot TTS Synthesis System , 2004, MLMI.

[26]  B. J. Koekkoek A Note on the German Borrowing of American Brand Names , 1958 .

[27]  James Emil Flege English vowel production by Dutch talkers: more evidence for the “similar” vs “new” distinction , 1997 .

[28]  Sung-Hyuk Cha,et al.  Language Identification from Text Using N-gram Based Cumulative Frequency Addition , 2004 .

[29]  Rada Mihalcea,et al.  A Method for Word Sense Disambiguation of Unrestricted Text , 1999, ACL.

[30]  Isabel Trancoso,et al.  On deriving rules for nativised pronunciation in navigation queries , 1999, EUROSPEECH.

[31]  Jonas Sj̈obergh Combining POS-taggers for improved accuracy on Swedish text , 2003 .

[32]  Dan Klein,et al.  Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[33]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[34]  Suzanne K. Hilgendorf The impact of English in Germany , 1996 .

[35]  Ronald M. Kaplan,et al.  The Interface between Phrasal and Functional Constraints , 1993, CL.

[36]  Amit Dubey,et al.  Statistical parsing for German: modeling syntactic properties and annotation differences , 2005 .

[37]  Harald Romsdorfer,et al.  Multi-context rules for phonological processing in polyglot TTS synthesis , 2004, INTERSPEECH.

[38]  Pierre Guiraud,et al.  Les mots étrangers , 1965 .

[39]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[40]  Carl F. Bayerschmidt,et al.  Americanized German@@@Englische Einflusse auf die deutsche Sprache nach 1945@@@Amerikanismen der deutschen Gegenwartssprache: Entlehnungsvorgange und ihre stilistischen Aspekte , 1966 .

[41]  David Yeandle,et al.  Types of Borrowing in Anglo-American Computing Terminology in German , 2001 .

[42]  Joseph P. Huffman Family, Commerce, and Religion in London and Cologne , 1999 .

[43]  R. Cole,et al.  Survey of the State of the Art in Human Language Technology , 2010 .

[44]  Hans Uszkoreit,et al.  A system for supporting cross-lingual information retrieval , 2000, Inf. Process. Manag..

[45]  Della Summers,et al.  Longman Dictionary of Contemporary English , 1995 .

[46]  Dagmar Schütte,et al.  Das schöne Fremde , 1996 .

[47]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[48]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[49]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[50]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[51]  Alex Tattersall The Internet and the French Language. Occasional Paper. , 2003 .

[52]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[53]  Ocke-Schwen Bohn,et al.  Interlingual identification and the role of foreign language experience in L2 vowel perception , 1990, Applied Psycholinguistics.

[54]  John T. Maxwell,et al.  Formal issues in lexical-functional grammar , 1998 .

[55]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[56]  Robert Eklund,et al.  Pronunciation in an internationalized society: a multi-dimensional problem considered , 1996 .

[57]  J. Bresnan Lexical-Functional Syntax , 2000 .

[58]  Suzanne K. Hilgendorf,et al.  English in Germany: contact, spread and attitudes , 2007 .

[59]  Yehuda N. Falk,et al.  Lexical-Functional Grammar: An Introduction to Parallel Constraint-Based Syntax , 2001 .

[60]  M. Clyne,et al.  Dynamics of Language Contact , 2003 .

[61]  Ruth King,et al.  The Lexical Basis of Grammatical Borrowing: A Prince Edward Island French case study , 2000 .

[62]  Robert Eklund,et al.  [jɑːmes] or [dʒɛɪmz] or Perhaps Something In-between? Recapping Three Years of Xenophone Studies , 1999 .

[63]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[64]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[65]  Christian Rohrer,et al.  Improving coverage and parsing quality of a large-scale LFG for German , 2006, LREC.

[66]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[67]  George Smith,et al.  A Brief Introduction to the TIGER Treebank, Version 1 , 2003 .

[68]  Sally Boyd,et al.  Progression & regression in language: Attrition or expansion? Changes in the lexicon of Finnish and American adult bilinguals in Sweden , 1994 .

[69]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[70]  Robert Eklund,et al.  How to handle "foreign" sounds in Swedish text-to-speech conversion: approaching the 'xenophone' problem , 1998, ICSLP.

[71]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[72]  Adam Kilgarriff,et al.  Putting frequencies in the dictionary , 1997 .

[73]  Shana Poplack,et al.  Contrasting patterns of code-switching in two communities , 1985 .

[74]  Katie McGrath,et al.  Language Identification and Language Specific Letter-to-Sound Rules , 2004 .

[75]  Shana Poplack,et al.  Variation theory and language contact , 1993 .

[76]  Jean Aitchison,et al.  Language and the Internet , 2002, Lit. Linguistic Comput..

[77]  Margie Berns,et al.  Bilingualism with English as the other tongue: English in the German legal domain , 1992 .

[78]  James W Breen,et al.  Expanding the lexicon: harvesting neologisms in Japanese , 2005 .

[79]  Hermann Dunger Wörterbuch von Verdeutschungen entbehrlicher Fremdwörter : mit besonderer Berücksichtigung der von dem Großen Generalstabe, im Postwesen und in der Reichsgesetzgebung angenommenen Verdeutschungen , 1882 .

[80]  W. Betz,et al.  Lehnwörter und Lehnprägungen im Vor- und Frühdeutschen , 1974 .

[81]  Robert Eklund,et al.  Xenophones: An investigation of phone set expansion in Swedish and implications for speech recognition and speech synthesis , 2001, Speech Commun..

[82]  Thorsten Brants,et al.  Inter-annotator Agreement for a German Newspaper Corpus , 2000, LREC.

[83]  Johan Bos,et al.  Cross-lingual Question Answering with QED , 2004, CLEF.

[84]  Lluís Padró,et al.  Comparing methods for language identification , 2004, Proces. del Leng. Natural.

[85]  J. Flege,et al.  Effects of experience on non-native speakers' production and perception of English vowels , 1997 .

[86]  DAVID SANKOFF,et al.  Borrowing: the synchrony of integration , 1984 .

[87]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[88]  Susan Robinson,et al.  German , 2006 .

[89]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[90]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[91]  Robert Eklund,et al.  How foreign are “foreign” speech sounds? Implications for speech recognition and speech synthesis , 2000 .

[92]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[93]  Michael,et al.  Corpus creation for lexicography , 2005 .

[94]  Claire Waast-Richard,et al.  A transformation-based learning approach to language identification for mixed-lingual text-to-speech synthesis , 2005, INTERSPEECH.

[95]  Arturo Tosi,et al.  Language and Society in a Changing Italy , 2000 .

[96]  Beatrice Alex,et al.  Integrating Language Knowledge Resources to Extend the English Inclusion Classifier to a New Language , 2006 .

[97]  Einar Haugen,et al.  The analysis of linguistic borrowing. , 1950 .

[98]  Gabriele Harris Global English and German Today. Occasional Paper. , 2003 .

[99]  Stefan Evert,et al.  The NITE XML Toolkit: Flexible annotation for multimodal language data , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[100]  Helmut Schmid,et al.  Etiquetage morphologique de textes français avec un arbre de décisions , 1995 .

[101]  Knut Hofland,et al.  The retrieval of false anglicisms in newspaper texts , 2007 .

[102]  Beatrice Alex,et al.  An Unsupervised System for Identifying English Inclusions in German Text , 2005, ACL.

[103]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[104]  Gerhard Stickel Zur Sprachbefindlichkeit der Deutschen: Erste Ergebnisse einer Repräsentativumfrage , 1999 .

[105]  Alexander Onysko,et al.  Anglicisms in German: Borrowing, Lexical Productivity, and Written Codeswitching , 2007 .

[106]  Taylor L. Booth,et al.  Applying Probability Measures to Abstract Languages , 1973, IEEE Transactions on Computers.

[107]  Robert Eklund,et al.  Xenophenomena: studies of foreign language influence at several linguistic levels , 2002 .

[108]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[109]  Jannis K. Androutopoulos,et al.  Sprachwahl im Werbeslogan. Zeitliche Entwicklung und branchenspezifische Verteilung englischer Slogans in der Datenbank von slogans.de , 2004 .

[110]  Ronald M. Kaplan,et al.  The importance of precise tokenizing for deep grammars , 2006, LREC.

[111]  Norbert Hedderich,et al.  Language Change in Business German , 2007 .

[112]  Wendy J. Holmes,et al.  Speech Synthesis and Recognition , 1988 .

[113]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[114]  John Sinclair,et al.  Collins COBUILD English Language Dictionary , 1987 .

[115]  W. Neumann Walter de Gruyter Berlin-New York , 1982 .

[116]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[117]  Gregory Grefenstette,et al.  Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation , 2004, ACL.

[118]  Amit Dubey,et al.  Parsing german with sister-head dependencies , 2003, Annual Meeting of the Association for Computational Linguistics.

[119]  Paulseph-John Farrugia,et al.  Text to Speech Technologies for Mobile Telephony Services , 2003 .

[120]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[121]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[122]  Cristiano Gino Furiassi Non-Adapted Anglicisms in Italian: Attitudes, Frequency Counts, and Lexicographic Implications , 2008 .

[123]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[124]  Dekang Lin,et al.  A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[125]  Marc Brysbaert,et al.  Lexique 2 : A new French lexical database , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[126]  Kerstin Gentsch English Borrowings in German Newspaper Language: Motivations, Frequencies, and Types, on the basis of the Frankfurter Allgemeine Zeitung, Muenchner Merkur, and Bild. , 2005 .

[127]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[128]  Beatrice Alex,et al.  Investigating the Effects of Selective Sampling on the Annotation Task , 2005 .

[129]  Eneko Agirre,et al.  Exploring Automatic Word Sense Disambiguation with Decision Lists and the Web , 2000, SAIC@COLING.

[130]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[131]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[132]  Michael Clyne The German language in a changing Europe: Acknowledgements , 1995 .

[133]  Martha Larson,et al.  Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches , 2000, INTERSPEECH.

[134]  Nick Campbell FOREIGN-LANGUAGE SPEECH SYNTHESIS , 1998 .

[135]  J. Flege,et al.  2. Effects of equivalence classification on the production of foreign language speech sounds , 1986 .

[136]  Hans Galinsky,et al.  American Neologisms in German , 1980 .

[137]  H. Steeneken,et al.  The intelligibility of German and English speech to Dutch listeners , 2000, INTERSPEECH.

[138]  吉村 淳一 『Richard Glahn : Der EinfluB des Englischen auf gesprochene deutsche Gegenwartssprache. Eine Analyse offentlich gesprochener Sprache am Beispiel von "Fernsehdeutsch" (2., durchgesehene Auflage)』, Frankfurt am Main: Peter Lang [= Angewandte Sprachwissenschaft Bd. 4], 2002 , 2004 .

[139]  Andrew Moody,et al.  English in Japanese popular culture and J-Pop music , 2006 .

[140]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[141]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[142]  Beatrice Alex,et al.  An XML-based Tool for Tracking English Inclusions in German Text , 2004 .

[143]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[144]  Jochen A. Bär Die Zukunft der deutschen Sprache , 2009 .

[145]  Ariadna Font Llitjós,et al.  Knowledge of language origin improves pronunciation accuracy of proper names , 2001, INTERSPEECH.

[146]  Carol Myers-Scotton,et al.  Duelling Languages: Grammatical Structure in Codeswitching , 1993 .

[147]  R Plomp,et al.  The effect of linguistic entropy on speech perception in noise in young and elderly listeners. , 1991, The Journal of the Acoustical Society of America.

[148]  Anders Lindström,et al.  A two-level approach to the handling of foreign items in Swedish speech technology applications , 2000, INTERSPEECH.

[149]  Jeffra Flaitz French attitudes toward the ideology of English as an international language , 1993 .

[150]  Gen-ichiro Kikui,et al.  Identifying the Coding System and Language of On-line Documents on the Internet , 1996, COLING.

[151]  Wojciech Skut,et al.  A Linguistically Interpreted Corpus of German Newspaper Text , 1998, LREC.

[152]  Ocke-Schwen Bohn,et al.  The Production of New and Similar Vowels by Adult German Learners of English , 1992, Studies in Second Language Acquisition.

[153]  Manfred Görlach Continental pun-dits , 1994 .

[154]  李幼升,et al.  Ph , 1989 .

[155]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[156]  Christian Jacquemin,et al.  Combining Lexical and Formatting Cues for Named Entity Acquisition from the Web , 2000, EMNLP.

[157]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[158]  Laura Callahan,et al.  Spanish/English Codeswitching in a Written Corpus , 2004 .

[159]  Yves Laroche-Claire Évitez le franglais, parlez français! , 2004 .

[160]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[161]  Ulrich Busse Anglizismen im Duden: eine Untersuchung zur Darstellung englischen Wortguts in den Ausgaben des Rechtschreibdudens von 1880 - 1986 , 1993 .

[162]  Wenliang Yang,et al.  Anglizismen im Deutschen: am Beispiel des Nachrichtenmagazins 'Der Spiegel' , 1990 .

[163]  Claire Grover,et al.  Tools to Address the Interdependence between Tokenisation and Standoff Annotation , 2006, NLPXML@EACL.

[164]  Manfred Görlach,et al.  Anglizismus – Purismus – Sprachliche Identität. Eine Untersuchung zu den Anglizismen in der deutschen und französischen Mediensprache , 2003 .

[165]  Stefanie Dipper,et al.  Implementing and documenting large scale grammars: German LFG , 2003 .

[166]  J. Movellan Tutorial on Hidden Markov Models , 2006 .

[167]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[168]  Harald Romsdorfer,et al.  Mixed-lingual text analysis for polyglot TTS synthesis , 2003, INTERSPEECH.

[169]  B. Carstensen Englische Einflüsse auf die deutsche Sprache nach 1945 , 1965 .

[170]  Miriam Butt,et al.  The Parallel Grammar Project , 2002, COLING 2002.

[171]  Judy Yoneoka The Striking Similarity between Korean and Japanese English Vocabulary , 2005 .

[172]  中井 駿二 The Christian Science Monitor , 1959 .

[173]  Etiemble Parlez-Vous Franglais? , 1964 .

[174]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[175]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.