The Web Library of Babel: evaluating genre collections

We present experiments in automatic genre classification on web corpora, comparing a wide variety of features on several different genreannotated datasets (HGC, I-EN, KI-04, KRYS-I, MGC and SANTINIS).We investigate the performance of several types of features (POS n-grams, character n-grams and word n-grams) and show that simple character n-grams perform best on current collections because of their ability to generalise both lexical and syntactic phenomena related to genres. However, we also show that these impressive results might not be transferrable to the wider web due to the lack of comparability between different annotation labels (many webpages cannot be described in terms of the genre labels in individual collections), lack of representativeness of existing collections (many genres are represented by webpages coming from a small number of sources) as well as problems in the reliability of genre annotation (many pages from the web are difficult to interpret in terms of the labels available). This suggests that more research is needed to understand genres on the Web.

[1]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[2]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[3]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[4]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[5]  Benno Stein,et al.  Genre Classification of Web Pages , 2004, KI.

[6]  Gil-Chang Kim,et al.  Multiple sets of features for automatic genre classification of web documents , 2005, Inf. Process. Manag..

[7]  Adele E. Howe,et al.  Effects of web document evolution on genre classification , 2005, CIKM '05.

[8]  Serge Sharo Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[9]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[10]  Vedrana Vidulin,et al.  Using Genres to Improve Search Engines , 2007 .

[11]  Efstathios Stamatatos,et al.  Webpage Genre Identification Using Variable-Length Character n-Grams , 2007 .

[12]  Serge Sharoff Classifying Web corpora into domain and genre using automatic feature identification , 2007 .

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  Jean Carletta,et al.  Squibs: Reliability Measurement without Limits , 2008, CL.

[15]  Yunhyong Kim,et al.  Building a document genre corpus: a profile of the KRYS I corpus , 2008 .

[16]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[17]  Bonnie L. Webber,et al.  Genre distinctions for discourse in the Penn TreeBank , 2009, ACL.

[18]  Jack Duffy,et al.  An N-gram Based Approach to Automatically Identifying Web Page Genre , 2009 .

[19]  Serge Sharoff,et al.  Web Genre Benchmark Under Construction , 2009, J. Lang. Technol. Comput. Linguistics.

[20]  Alexander Mehler,et al.  Genres on the Web: Computational Models and Empirical Studies , 2010 .

[21]  Serge Sharoff,et al.  In the Garden and in the Jungle Comparing Genres in the BNC and Internet , 2010 .

[22]  Marina Santini Cross-Testing a Genre Classification Model for the Web , 2011, Genres on the Web.