A probabilistic approach to document classification

In this paper, we propose and experiment a probabilistic approach to document classification. We consider the problem of automatically assigning a new article to a Usenet newsgroup. To model a newsgroup, we build a probabilistic language model which is supposed to generate articles for this newsgroup. When a new article is presented, we use a Maximum A Posteriori rule to decide if the message was generated by this newsgroup or not. We evaluate this approach and compare it to a classification based on keywords. On these cases, the probabilistic approach gives better recall and precision indicators. The paper is structured as follows: we first present the problem of document classification in general terms. We then describe our application to newsgroup classification and present the data that we are using. We present first results for a classification based on keyword selection. Finally, we describe the probabilistic formulation of the problem, experiment this approach on the same data and compare the results.