Most research on automated text categorization has focused on determining the topic of a given text. While topic is generally the main characteristic of an information need, there are other characteristics that are useful for information retrieval. In this paper we consider the problem of text categorization according to style. For example, in searching the web, we may wish to automatically determine if a given page is promotional or informative, was written by a native En-glish speaker or not, and so on. Learning to determine the style of a document is a dual to that of determining its topic, in that those document features which capture the style of a document are precisely those which are independent of its topic. We here deene the features of a document to be the frequencies of each of a set of function words and parts-of-speech triples. We then use machine learning techniques to classify documents. We test our methods on four collections of downloaded newspaper and magazine articles.
[1]
H. van Halteren,et al.
Outside the cave of shadows: using syntactic annotation to enhance authorship attribution
,
1996
.
[2]
Anthony McEnery,et al.
Authorship studies/textual statistics.
,
2000
.
[3]
M. Kendall.
The Statistical Study of Literary Vocabulary
,
1944,
Nature.
[4]
Eric Brill,et al.
A Simple Rule-Based Part of Speech Tagger
,
1992,
HLT.
[5]
Ido Dagan,et al.
Mistake-Driven Learning in Text Categorization
,
1997,
EMNLP.
[6]
G. Yule.
ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP
,
1939
.
[7]
James P. Callan,et al.
Training algorithms for linear text classifiers
,
1996,
SIGIR '96.
[8]
Frederick Mosteller,et al.
Applied Bayesian and classical inference : the case of the Federalist papers
,
1984
.
[9]
William W. Cohen.
Fast Eeective Rule Induction
,
1995
.