Aspect Marking in English and Chinese: Using the Lancaster Corpus of Mandarin Chinese for Contrastive Language Study

This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. We first discuss the major decisions we took when building the corpus. These relate to sampling, text collection, mark-up, and annotation. Following from this we use the corpus to study aspect marking in Chinese and British/American English. The study shows that although Chinese and English are typologically different, aspect markers in the two languages show a strikingly similar distribution pattern, especially across the two broad categories of narrative and expository texts. The study also reveals some important differences in the distribution of aspect markers in Chinese versus English and British versus American English across fifteen text categories, and provides an account of these differences.

[1]  Shiwen Yu,et al.  Annotating the Contemporary Chinese Corpus , 1997 .

[2]  Chu-Ren Huang,et al.  Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[3]  Zhang Hua-ping Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method , 2002 .

[4]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[5]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[6]  Feng Zhiwei Hybrid Approaches for Automatic Segmentation and Annotation of a Chinese Text Corpus , 2001 .

[7]  Geoffrey Leech,et al.  Introducing corpus annotation , 1997 .

[8]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[9]  A. Svalberg,et al.  Are English and Malay worlds apart? Typological distance and the learning of tense and aspect concepts , 1998 .

[10]  J. M. Peirce Aspect , 1871, Nature.

[11]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[12]  Matthew Bruce Christensen Variation in Spoken and Written Mandarin Narrative Discourse , 1994 .

[13]  Graeme Hirst,et al.  Book Reviews: Longman Grammar of Spoken and Written English , 2001, Computational Linguistics.

[14]  Søren Egerod Aspect in Chinese , 1994 .

[15]  Lou Burnard,et al.  Xara : an XML aware tool for corpus searching , 2003 .

[16]  Qun Liu,et al.  Automatic Recognition of Chinese Unknown Words Based on Roles Tagging , 2002, SIGHAN@COLING.

[17]  Pascale Fung,et al.  Improving Chinese Tokenization With Linguistic Filters On Statistical Lexical Acquisition , 1994, ANLP.

[18]  Tony McEnery,et al.  Domains, text types, aspect marking and English-Chinese translation , 1999 .