Entropy in different text types

The present investigation is an attempt to investigate how the unique linguistic profile of different text types can be reflected in their respective entropy characteristics. With samples from the Lancaster Corpus of Mandarin Chinese and the Freiburg–Brown corpus of American English , the research investigates entropy performances in two dimensions: the relative entropy of words and their part-of-speech (POS) on different sentential positions, and entropy of aspect markers. Our research yields the following results: First, it shows a strikingly similar distribution pattern in Chinese and English concerning the relative entropy of word-forms and POS-forms on different sentential positions. The relative entropy of word-forms in descending order yields: news > essays > official > academic > fiction, and the POS-forms yields: fiction > essays > news > academic > official. The relative entropy of POS-forms may be a more reliable indicator of syntactical differences, which helps to distinguish dichotomous ‘narrative vs. expository’ text types in both Chinese and English. Second, there exists a cross-linguistic difference concerning entropy of aspect markers, namely, Chinese displays higher relative entropy than English. This indicates that aspect-marking in terms of variation is more prominent in Chinese grammar than in English. The ‘narrative vs. expository distinction’ is also identified by entropy of aspect markers in both Chinese and English, though more obviously in Chinese.

[1]  Gabriel Altmann,et al.  Quantitative Analysis of Poetic Texts , 2015 .

[2]  Ramon Ferrer-i-Cancho,et al.  Quantifying the Semantic Contribution of Particles , 2002, J. Quant. Linguistics.

[3]  Zhiyi Zhang,et al.  Authorship Attribution Using Entropy , 2013, J. Quant. Linguistics.

[4]  Dmitry V. Khmelev Disputed Authorship Resolution through Using Relative Empirical Entropy for Markov Chains of Letters in Human Language Texts , 2000, J. Quant. Linguistics.

[5]  Reinhard Köhler,et al.  Bibliography of quantitative linguistics , 1995 .

[6]  Vassilios Constantoudis,et al.  Word-length Entropies and Correlations of Natural Language Written Texts , 2014, J. Quant. Linguistics.

[7]  Tony McEnery,et al.  Corpus-Based Contrastive Studies of English and Chinese , 2010 .

[8]  Patrick Juola,et al.  Using the Google N-Gram corpus to measure cultural complexity , 2013, Lit. Linguistic Comput..

[9]  Ioan-Iovitz Popescu,et al.  Word Frequency Studies , 2009 .

[10]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[11]  Sakai Hiromu,et al.  Entropy and Redundancy of Japanese Lexical and Syntactic Compound Verbs , 2004, J. Quant. Linguistics.

[12]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[13]  Miroslav Kubat,et al.  Vocabulary Richness Measure in Genres , 2013, J. Quant. Linguistics.

[14]  Fabio G. Guerrero On the Entropy of Written Spanish , 2009, ArXiv.

[15]  Anthony McEnery,et al.  The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study , 2004, LREC.

[16]  Richard Xiao,et al.  Corpus-Based Studies of Translational Chinese in English-Chinese Translation , 2015 .

[17]  Miroslav Kubát,et al.  Quantitative Index Text Analyser (QUITA) , 2014, Glottometrics.

[18]  Zhang Hua-ping Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method , 2002 .

[19]  Arber Borici,et al.  Entropy-based Assessment of Written Albanian Language* , 2011, J. Quant. Linguistics.

[20]  Key-Sun Choi,et al.  An Upper Bound Estimate for the Entropy of Korean Texts , 1996 .

[21]  Kumiko Tanaka-Ishii,et al.  Information Bias Inside English Words* , 2012, J. Quant. Linguistics.

[22]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.