论文信息 - Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development

Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development

This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.

[1] Tony McEnery,et al. A new agenda for corpus linguistics - working with all of the world's languages , 2000 .

[2] Andrew Hardie,et al. The computational analysis of morphosyntactic categories in Urdu , 2004 .

[3] Hamish Cunningham,et al. GATE-a General Architecture for Text Engineering , 1996, COLING.

[4] Miriam Butt. The Structure of Complex Predicates in Urdu , 1995 .

[5] Anthony McEnery,et al. Building a corpus of spoken sylheti. , 1999 .

[6] Colin P. Masica. The Indo-Aryan Languages , 1991 .

[7] Kalina Bontcheva,et al. A Unicode-based Environment for Creation and Use of Language Resources , 2002, LREC.

[8] Signe Oksefjell,et al. A description of the English-Norwegian parallel corpus : Compilation and further developments , 1999 .

[9] Geoffrey Leech,et al. Standards for Tagsets. , 1999 .

[10] Bernard Comrie,et al. The Major languages of South Asia, the Middle East and Africa , 1990 .

[11] Tony McEnery,et al. EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation , 2002, LREC.

[12] Bidyut B. Chaudhuri,et al. Computer recognition of printed Bangla script , 1995 .

[13] Akira Nakanishi,et al. Writing Systems of the World , 1980 .

[14] Michael C. Shapiro. An introduction to Hindi and Urdu , 1980 .