Building a document genre corpus: a profile of the KRYS I corpus

This paper describes the KRYS I corpus, consisting of documents classified into 70 genre classes. It has been constructed as part of an effort to automate document genre classification as distinct from topic detection. Previously there has been very little work on building corpora of texts which have been classified using a non-topical genre palette. The reason for this is partly due to the fact that genre as a concept, is rooted in philosophy, rhetoric and literature, and highly complex and domain dependent in its interpretation ([11]). The usefulness of genre in everyday information search is only now starting to be recognised and there is no genre classification schema that has been consolidated to have applicable value in this direction. By presenting here our experiences in constructing the KRYS I corpus, we hope to shed light on the information gathering and seeking behaviour and the role of genre in these activities, as well as a way forward for creating a better corpus for testing automated genre classification tasks and the application of these tasks to other domains.

[1]  Carol Van Ess-Dykema,et al.  The Form is the Substance: Classification of Genres in Text , 2001, HTLKM@ACL.

[2]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[3]  Carolyn R. Miller Genre as social action , 1984 .

[4]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[5]  Yunhyong Kim,et al.  Variation of word frequencies across genre classification tasks , 2007 .

[6]  Kevin Crowston,et al.  Genres of Digital Documents: Introduction to the Special Issue , 2005 .

[7]  Benno Stein,et al.  Distinguishing Topic from Genre , 2004 .

[8]  George R. Thoma Automating the production of bibliographic records for MEDLINE , 2001 .

[9]  W. Orlikowski,et al.  Genre Systems: Structuring Interaction through Communicative Norms , 2002 .

[10]  Seamus Ross,et al.  Preservation research and sustainable digital libraries , 2005, International Journal on Digital Libraries.

[11]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[12]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[13]  Chris Bowerman,et al.  PERC: A Personal Email Classifier , 2006, ECIR.

[14]  Paul H. Garthwaite,et al.  Frequent Term Distribution Measures for Dataset Profiling , 2004, LREC.

[15]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[16]  Yunhyong Kim,et al.  Examining Variations of Prominent Features in Genre Classification , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).