Using Resource-Rich Languages to Improve Morphological Analysis of Under-Resourced Languages

The world-wide proliferation of digital communications has created the need for language and speech processing systems for under-resourced languages. Developing such systems is challenging if only small data sets are available, and the problem is exacerbated for languages with highly productive morphology. However, many under-resourced languages are spoken in multi-lingual environments together with at least one resource-rich language and thus have numerous borrowings from resource-rich languages. Based on this insight, we argue that readily available resources from resource-rich languages can be used to bootstrap the morphological analyses of under-resourced languages with complex and productive morphological systems. In a case study of two such languages, Tagalog and Zulu, we show that an easily obtainable English wordlist can be deployed to seed a morphological analysis algorithm from a small training set of conversational transcripts. Our method achieves a precision of 100% and identifies 28 and 66 of the most productive affixes in Tagalog and Zulu, respectively.

[1]  Hermann Ney,et al.  Morpheme Level Feature-based Language Models for German LVCSR , 2012, INTERSPEECH.

[2]  Kenneth Zuercher,et al.  Azerbaijani-Russian Code-Switching and Code-Mixing: Form, Function, and Identity. , 2010 .

[3]  Chris Brew,et al.  A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources , 2004, EMNLP.

[4]  Alon Lavie,et al.  Unsupervised Induction of Natural Language Morphology Inflection Classes , 2004, SIGMORPHON@ACL.

[5]  Kingsley Bolton,et al.  Philippine English: Linguistic and Literary , 2008 .

[6]  W. Lewis,et al.  Building MT for a Severely Under-Resourced Language: White Hmong , 2012, AMTA.

[7]  Mikko Kurimo,et al.  Overview and Results of Morpho Challenge 2009 , 2009, CLEF.

[8]  Uri Tadmor,et al.  Loanwords in the World's Languages: A Comparative Handbook , 2009 .

[9]  Ebru Arisoy,et al.  Unlimited vocabulary speech recognition for agglutinative languages , 2006, NAACL.

[10]  Willem Adelaar Quechua–Spanish Bilingualism: Interference and Convergence in Functional Categories. By Liliana Sánchez. Language Acquisition and Language Disorders, vol. 35. Amsterdam and Philadelphia: John Benjamins, 2003. Pp. 189. , 2008 .

[11]  Laurent Besacier,et al.  Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[13]  Baden Hughes,et al.  Frontiers in Linguistic Annotation for Lower-Density Languages , 2006 .

[14]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.

[15]  Anna Feldman,et al.  A Resource-Light Approach to Morpho-Syntactic Tagging , 2009 .

[16]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[17]  Stefan Schulz,et al.  Biomedical text retrieval in languages with a complex morphology , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[18]  A. Kilgarriff Simple Maths for Keywords , 2009 .

[19]  Yiming Yang,et al.  Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[21]  Adam Kilgarriff,et al.  Language is never, ever, ever, random , 2005 .

[22]  Peter A. Flach,et al.  Ukwabelana - An open-source morphological Zulu corpus , 2010, COLING.

[23]  Pierre Zweigenbaum,et al.  Acquiring meaning for French medical terminology: contribution of morphosemantics , 2004, MedInfo.

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Nizar Habash,et al.  Orthographic and morphological processing for English–Arabic statistical machine translation , 2011, Machine Translation.

[26]  Vincent Ng,et al.  High-Performance, Language-Independent Morphological Segmentation , 2007, HLT-NAACL.

[27]  Delphine Bernhard,et al.  Unsupervised Morphological Segmentation Based on Segment Predictability and Word Segments Alignment , 2009 .

[28]  Liliana Sánchez,et al.  Quechua-Spanish bilingualism , 2003 .

[29]  Malini Ramsay-Brijball Understanding Zulu-English code-switching: a psycho-social perspective , 1999 .

[30]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[31]  Paul Schachter,et al.  Tagalog reference grammar , 1973, The Journal of Asian Studies.

[32]  Hermann Ney,et al.  Investigating the use of morphological decomposition and diacritization for improving Arabic LVCSR , 2009, INTERSPEECH.