论文信息 - SINICA CORPUS : Design Methodology for Balanced Corpora

SINICA CORPUS : Design Methodology for Balanced Corpora

The Academia Sinica Balanced Corpus (Sinica Corpus) is the first balanced Chinese corpus with part-of-speech tagging. The corpus (Sinica 2.0) is open to the research community through the WWW (http://www.sinica.edu.twiftmsbinikiwi.sh). Current size of the corpus is 3.5 million words, and the immediate expansion target is five million words. Each text in the corpus is classified and marked according to five criteria: genre, style, mode, topic, and source. The feature values of these classifications are assigned in a hierarchy. Subcorpora can be defined with a specific set of attributes to serve different research purposes. Texts in the corpus are segmented according to the word segmentation standard proposed by the ROC Computational Linguistic Society. Each segmented word is tagged with its part-of-speech. Linguistic patterns and language structures can be extracted from the tagged corpus via a corpus inspection program which has the functions of KWIC searching, filtering, statistics, printing, and collocation.

[1] Kenneth Ward Church,et al. Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[2] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[3] Keh-Jiann Chen,et al. Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[4] H. Kucera,et al. Computational analysis of present-day American English , 1967 .

[5] Chilin Shih,et al. A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[6] Li-Ping Chang,et al. A Practical Tagger for Chinese Corpora , 1994, ROCLING/IJCLCLP.

[7] Branimir Boguraev,et al. Review of Looking up: an account of the COBUILD project in lexical computing by John M. Sinclair. Collins ELT 1987. , 1990 .

[8] Alvar Ellegård,et al. The syntactic structure of English texts: A computer-based study of four kinds of text in the Brown university corpus , 1978 .

[9] Jan Svartvik,et al. Directions in corpus linguistics : proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991 , 1992 .

[10] B. Boguraev. Book Reviews: Looking Up: An Account of the COBUILD PROJECT IN LEXICAL COMPUTING , 1990, CL.

[11] Chu-Ren Huang,et al. Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[12] J. Jenkins,et al. Word association norms , 1964 .