论文信息 - Analysis on Chinese quantitative stylistic features based on text mining

Analysis on Chinese quantitative stylistic features based on text mining

In this article, data mining was selected to examine whether some linguistic features, taking parts of speech (POS) for instance, can be used as Chinese quantitative stylistic feature. It can be also said that the purpose of this article is to explore the method to determine the Chinese quantitative stylistic features. Texts of different styles, which are news, science, official, art, TV conversation, and daily conversation styles, were selected to establish the corpus for our study. Text vectors characterized by POS were analyzed by principal component analysis and clustered by agglomerative hierarchical clustering method. The results of them indicate that POS can be used as a distinctive feature of texts. Then, support vector machine was adopted to establish classification model on training data and precision and recall rates to validate the results of text classification. Random forest was selected to compute the importance of POS, i.e. the contribution to classification, and text vectors characterized by important POS were clustered and classified consequently. The results of the experiments show that POS can be taken as Chinese quantitative stylistic feature, and the results of clustering and classification are preferably taking the 60 most important POS as the character of texts.

Minghu Jiang | Renkui Hou

[1] Zheng-sheng Zhang. A corpus study of variation in written Chinese , 2012 .

[2] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3] Mario Cortina-Borja,et al. A stylometric analysis of newspapers, periodicals and news scripts , 2006, J. Quant. Linguistics.

[4] D. Biber. Spoken and Written Textual Dimensions in English: Resolving the Contradictory Findings , 1986 .

[5] John M. Swales,et al. Genre Analysis: English in Academic and Research Settings , 1993 .

[6] Takafumi Suzuki,et al. Stylistic Analysis of Text Submissions to Japanese Q & A Communities* , 2012, J. Quant. Linguistics.

[7] Thomas Fang Zheng,et al. Language model adaptation based on the classification of a trigram's language style feature , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[8] Mari Ostendorf,et al. Relevance weighting for combining multi-domain data for n-gram language modeling , 1999, Comput. Speech Lang..

[9] Feng Shengl. On mechanisms of Register System and its grammatical property , 2010 .

[10] D. Biber. Methodological Issues Regarding Corpus-based Analyses of Linguistic Variation , 1990 .

[11] Boudewijn P. F. Lelieveldt,et al. A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..