论文信息 - EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation

EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation

The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

[1] Tony McEnery,et al. A new agenda for corpus linguistics - working with all of the world's languages , 2000 .

[2] Bidyut B. Chaudhuri,et al. Computer recognition of printed Bangla script , 1995 .

[3] Tony McEnery,et al. Validation tecniques for language corpora: a report from the front , 1998, LREC.

[4] Anthony McEnery,et al. Building a corpus of spoken sylheti. , 1999 .

[5] Tony McEnery,et al. Building a parallel corpus of English/Panjabi , 2000 .

[6] Geoffrey Leech,et al. Spoken English on Computer: Transcription, Mark-Up and Application , 1995 .

[7] Tony McEnery,et al. Corpus Resources and Minority Language Engineering , 2000, LREC.

[8] Kalina Bontcheva,et al. Experience using GATE for NLP R&D , 2000, COLING 2000.

[9] Geoffrey Leech,et al. Standards for Tagsets. , 1999 .

[10] Signe Oksefjell,et al. A description of the English-Norwegian parallel corpus : Compilation and further developments , 1999 .