Asian language processing: current state-of-the-art

Asian language processing presents formidable challenges to achieving multilingualism and multiculturalism in our society. One of the first and most obvious challenges is the multitude and diversity of languages: more than 2,000 languages are listed as languages in Asia by Ethnologue (Gordon, 2005), representing four major language families: Austronesian, Trans-New Guinea, Indo-European, and Sino-Tibetan1. The challenge is made more formidable by the fact that as a whole, Asian languages range from the language with most speakers in the world (Mandarin Chinese, close to 900 million native speakers) to the more than 70 nearly extinct languages (e.g. Pazeh in Taiwan, one speaker). As a result, there are vast differences in the level of language processing capability and the number of sharable resources available for individual languages. Major Asian languages such as Mandarin Chinese, Hindi, Japanese, Korean, and Thai have benefited from several years of intense language processing research, and fast-developing languages (e.g., Filipino, Urdu, and Vietnamese) are gaining ground. However, for many nearextinct languages, research and resources are scarce, and computerization represents the last resort for preservation after extinction. A comprehensive overview of the current state of Asian language processing must necessarily address the range of issues that arise due to the diversity of Asian languages and must reflect the vastly different state-ofthe-art for specific languages. Therefore, we have divided the special issues on Asian language technology into two parts. The first is a double issue entitled Asian Language Processing: State of the Art Resources and Processing, which focuses on state-of-the-art research issues given the diversity of Asian languages. Although the majority of papers in this double issue deal with

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Masaaki Nagata Context-Based Spelling Correction for Japanese OCR , 1996, COLING.

[3]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[4]  Nicoletta Ide Nancy Calzolari,et al.  Language Resources and Evaluation , 1966 .

[5]  鍾曉芳 Extending an international lexical framework for Asian languages, the case of Mandarin, Taiwanese, Cantonese, Bangla and Malay , 2006 .

[6]  Makoto Nagao,et al.  A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structures , 1994, CL.

[7]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[8]  Chu-Ren Huang,et al.  Infrastructure for Standardization of Asian Language Resources , 2006, ACL.

[9]  Dan Flickinger,et al.  Minimal Recursion Semantics: An Introduction , 2005 .

[10]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[11]  Masaaki Nagata Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model , 1998, COLING-ACL.

[12]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[13]  Chu-Ren Huang,et al.  Extending an international lexical framework for Asian languages, the case of Mandarin, Taiwanese, Cantonese, Bangla and Malay , 2007 .

[14]  黄 居仁,et al.  Computational linguistics and beyond , 2004 .

[15]  Claudia Soria,et al.  Lexical Markup Framework (LMF) , 2006, LREC.

[16]  Miriam Butt,et al.  Urdu in a parallel grammar development environment , 2007, Lang. Resour. Evaluation.