Textual Characteristics of Different-sized Corpora

Recently, textual characteristics, i.e. certain language statistics, have been proposed to compare corpora originating from different genres and domains, to give guidance in language engineering processes and to estimate the transferability of natural language processing algorithms from one corpus to another. However, until now it is unclear how these textual characteristics behave for different-sized corpora. We monitor the behavior of 7 textual characteristics across 4 genres – news articles, Wikipedia articles, general web text and fora posts – and 10 corpus sizes, ranging from 100 to 3,000,000 sentences. Thereby we show, certain textual characteristics are almost constant across corpus sizes and thus might be used to reliably compare different-sized corpora, while others are highly corpus size-dependent and thus may only be used to compare similaror same-sized corpora. Moreover we find, although textual characteristics vary from genre to genre, their behavior for increasing corpus size is quite similar.

[1]  Christopher S. G. Khoo,et al.  Comparing sentiment expression in movie reviews from four online genres , 2010, Online Inf. Rev..

[2]  Christian Biemann,et al.  Corpus Portal for Search in Monolingual Corpora , 2006, LREC.

[3]  Dong Wang,et al.  A Cross-corpus Study of Unsupervised Subjectivity Identification based on Calibrated EM , 2011, WASSA@ACL.

[4]  Satoshi Sekine,et al.  The Domain Dependence of Parsing , 1997, ANLP.

[5]  Panagiotis G. Ipeirotis,et al.  Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[7]  Thomas Eckart,et al.  Statistical Corpus and Language Comparison on Comparable Corpora , 2013, Building and Using Comparable Corpora.

[8]  Christopher S. G. Khoo,et al.  Textual and Informational Characteristics of Health-Related Social Media Content: A Study of Drug Review Forums , 2011 .

[9]  An Empirical Study of the Domain Dependence of Supervised Word Sense Disambiguation Systems , 1997 .

[10]  Kyo Kageura,et al.  Exploring the Microscopic Textual Characteristics of Japanese Prime Ministers’ Diet Addresses by Measuring the Quantity and Diversity of Nouns , 2007, PACLIC.

[11]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[12]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[13]  Robert Remus,et al.  Textual Characteristics for Language Engineering , 2012, LREC.

[14]  A. Kilgarriff Comparing Corpora , 2001 .