Thematic Concentration as a Discriminating Feature of Text Types

Abstract Generally, human brains can grasp intuitively the gist of thematic content of different texts through comprehensive reading, and such human-like generalization process may be accomplished with a more exact basis. With three representative text types in Chinese and English from two comparative corpora as our focus, that is, LCMC (the Lancaster Corpus of Mandarin Chinese) and Frown (the Freiburg-Brown Corpus of American English), this study compares thematic characteristics of these texts with PAM (Partition around Medoids) and HA (Hierarchical Agglomerative) clustering via three quantitative indicators, namely, TC (Thematic Concentration), STC (Secondary Thematic Concentration) and PTC (Proportional Thematic Concentration). The results show that: (1) eigenvectors standing for the thematic characteristic of three text types can be clustered into their corresponding categories in both Chinese and English; (2) two contributing factors are identified for the clustering results. One is the differences of TC, STC and PTC values of three text types lying in different hierarchical levels; the other is the differences of the percentages of ‘thematic words’, especially nouns at the pre-h-point and pre-2 h-point domain in three text types. The characterization of three text types as thematic-intensive (Official Document), thematic-balanced (News) and thematic-dispersive (Fiction) bears a cross-linguistic similarity in both Chinese and English.

[1]  Eve Sweetser From Etymology to Pragmatics: Subject index , 1990 .

[2]  Eve Sweetser From Etymology to Pragmatics: List of abbreviations , 1990 .

[3]  Špela Vintar A bird's eye view of lexical creativity in original vs. translated Slovene fiction , 2016 .

[4]  M. Coulthard,et al.  On the use of corpora in the analysis of forensic texts , 2013 .

[5]  Kobie van Krieken,et al.  Viewpoint representation in journalistic crime narratives: An analysis of grammatical roles and referential expressions , 2015 .

[6]  Haitao Liu,et al.  Ideologies of Supreme Court Justices: Quantitative Thematic Analysis of Multiple Opinions of “Bush v. Gore 2000” , 2015 .

[7]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[8]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[9]  Eve Sweetser From Etymology To Pragmatics , 1990 .

[10]  Stephen J. Bensman The evaluation of research by scientometric indicators , 2011, J. Assoc. Inf. Sci. Technol..

[11]  A. Schubert,et al.  Literature of Analytical Chemistry: A Scientometric Evaluation , 1986 .

[12]  Gabriel Altmann,et al.  Writer's view of text generation , 2007, Glottometrics.

[13]  Ioan-Iovitz Popescu Text ranking by the weight of highly frequent words , 2007, Exact Methods in the Study of Language and Text.

[14]  Gabriel Altmann,et al.  Testing the Thematic Concentration of Text , 2015, J. Quant. Linguistics.

[15]  Miroslav Kubát,et al.  Quantitative Index Text Analyser (QUITA) , 2014, Glottometrics.

[16]  Zhang Hua-ping Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method , 2002 .

[17]  J. Hirsch Does the h index have predictive power? , 2007, Proceedings of the National Academy of Sciences.

[18]  Radek Čech Language and ideology: quantitative thematic analysis of New Year speeches given by Czechoslovak and Czech presidents (1949–2011) , 2014 .

[19]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[20]  Anthony McEnery,et al.  The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study , 2004, LREC.

[21]  Douglas Biber Compressed noun-phrase structures in newspaper discourse: The competing demands of popularization vs. economy , 2004 .

[22]  Ioan-Iovitz Popescu,et al.  Word Frequency Studies , 2009 .

[23]  Michael Parkinson,et al.  The Evaluation of Research by Scientometric Indicators , 2011 .

[24]  Jiang Yang,et al.  A Study on Chinese Quantitative Stylistic Features and Relation Among Different Styles Based on Text Clustering* , 2014, J. Quant. Linguistics.

[25]  Gabriel Altmann,et al.  Some Geometric Properties of Slovak Poetry* , 2012, J. Quant. Linguistics.

[26]  Andrew Wilson Vocabulary richness and thematic concentration in internet fetish fantasies and literary short stories , 2009 .

[27]  Paul Rayson,et al.  From key words to key semantic domains , 2008 .

[28]  郭健生 Style in Fiction:A Linguistic Introduction to English Fictional Prose , 1983 .

[29]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[30]  Paul Rayson Wmatrix : a web-based corpus processing environment , 2022 .

[31]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[32]  Arjuna Tuzzi,et al.  Zipf's Laws in Italian Texts , 2009, J. Quant. Linguistics.

[33]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[34]  George admires Adolf,et al.  Tense and Aspect , 2004 .