Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims

Those trying to make sense of the notion of textual content and semantics within the wild, wild world of information retrieval, categorization, and filtering have to deal often with an overwhelming sea of problems. The really strange story is that most of them (myself included) still believe that developing a linguistically principled approach to text categorization is an interesting research problem. This will also emerge in the discussion of the book that is the focus of this review. Learning to Classify Texts Using Support Vector Machines by Thorsten Joachims proposes a theory for automatic learning of text categorization models that has been repeatedly shown to be very successful. At the same time, the approach proposed is based on a rather rough linguistic generalization of (what apparently is) a language-dependent task: topic text classification (TC). The result is twofold: on the one hand, a learning theory, based on statistical learnability principles and results, that avoids the limitations of the strong empiricism typical of most text classification research; and on the other hand, the application of a naive linguistic model, the bag-of-words representation, to linguistic objects (i.e., the documents) that still achieves impressive accuracy.