The World Wide Web is a vast repository of information, but the sheer volume makes it difficult to identify useful documents. We identify document genre is an important factor in retrieving useful documents and focus on the novel document genre dimension of subjectivity. We investigate three approaches to automatically classifying documents by genre: traditional bag of words techniques, part-of-speech statistics, and hand-crafted shallow linguistic features. We are particularly interested in domain transfer: how well the learned classifiers generalize from the training corpus to a new document corpus. Our experiments demonstrate that the part-of-speech approach is better than traditional bag of words techniques, particularly in the domain transfer conditions.
[1]
Eric Brill,et al.
Some Advances in Transformation-Based Part of Speech Tagging
,
1994,
AAAI.
[2]
Alberto Maria Segre,et al.
Programs for Machine Learning
,
1994
.
[3]
Jussi Karlgren,et al.
Iterative Information Retrieval Using Fast Clustering and Usage-Specific Genres
,
1999
.
[4]
Jussi Karlgren,et al.
Stylistic Experiments for Information Retrieval
,
1999
.
[5]
Janyce Wiebe,et al.
Learning Subjective Adjectives from Corpora
,
2000,
AAAI/IAAI.
[6]
Andreas Rauber,et al.
Integrating automatic genre analysis into digital libraries
,
2001,
JCDL '01.