Analysis on Chinese quantitative stylistic features based on text mining

In this article, data mining was selected to examine whether some linguistic features, taking parts of speech (POS) for instance, can be used as Chinese quantitative stylistic feature. It can be also said that the purpose of this article is to explore the method to determine the Chinese quantitative stylistic features. Texts of different styles, which are news, science, official, art, TV conversation, and daily conversation styles, were selected to establish the corpus for our study. Text vectors characterized by POS were analyzed by principal component analysis and clustered by agglomerative hierarchical clustering method. The results of them indicate that POS can be used as a distinctive feature of texts. Then, support vector machine was adopted to establish classification model on training data and precision and recall rates to validate the results of text classification. Random forest was selected to compute the importance of POS, i.e. the contribution to classification, and text vectors characterized by important POS were clustered and classified consequently. The results of the experiments show that POS can be taken as Chinese quantitative stylistic feature, and the results of clustering and classification are preferably taking the 60 most important POS as the character of texts.

[1]  Zheng-sheng Zhang A corpus study of variation in written Chinese , 2012 .

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Mario Cortina-Borja,et al.  A stylometric analysis of newspapers, periodicals and news scripts , 2006, J. Quant. Linguistics.

[4]  D. Biber Spoken and Written Textual Dimensions in English: Resolving the Contradictory Findings , 1986 .

[5]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[6]  Takafumi Suzuki,et al.  Stylistic Analysis of Text Submissions to Japanese Q & A Communities* , 2012, J. Quant. Linguistics.

[7]  Thomas Fang Zheng,et al.  Language model adaptation based on the classification of a trigram's language style feature , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[8]  Mari Ostendorf,et al.  Relevance weighting for combining multi-domain data for n-gram language modeling , 1999, Comput. Speech Lang..

[9]  Feng Shengl On mechanisms of Register System and its grammatical property , 2010 .

[10]  D. Biber Methodological Issues Regarding Corpus-based Analyses of Linguistic Variation , 1990 .

[11]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[12]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[13]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[14]  Mari Ostendorf,et al.  Adaptive language modeling with varied sources to cover new vocabulary items , 2004, IEEE Transactions on Speech and Audio Processing.

[15]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[16]  David L. Hoover Frequent Word Sequences and Statistical Stylistics , 2002, Lit. Linguistic Comput..

[17]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[18]  Peter Dixon,et al.  Sentence-length and Authorship Attribution: the Case of Oliver Goldsmith , 2004, Lit. Linguistic Comput..

[19]  Feng Zhiwei Research on Text Clustering Based on Dependency Treebank , 2011 .

[20]  Douglas Biber,et al.  Using Register-Diversified Corpora for General Language Studies , 1993, Comput. Linguistics.

[21]  Douglas Douglas,et al.  The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings , 1992, Comput. Humanit..

[22]  Maya R. Gupta,et al.  Part-of-speech histograms for genre classification of text , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .