Introduction to Information Retrieval: Text classification and Naive Bayes

Thus far, this book has mainly discussed the process of ad hoc retrieval , where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips . One way of doing this is to issue the query multicore and computer and chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems support standing queries . A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time. If your standing query is just multicore and computer and chip, you will tend to miss many relevant new articles which use other terms such as multicore processors . To achieve good recall, standing queries thus have to be refined over time and can gradually become quite complex. In this example, using a Boolean search engine with stemming, you might end up with a query like (multicore or multi-core) and (chip or processor or microprocessor). To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classification problem. Given a set of classes , we seek to determine which class(es) a given object belongs to.