Scalability of Text Classification

We explore scalability issues of the text classification problem where using (multi)labeled training documents we try to build classifiers that assign documents into classes permitting classification in multiple classes. A new class of classification problems, called ‘scalable’ is introduced that models many problems from the area of Web mining. The property of scalability is defined as the ability of a classifier to adjust classification results on a ‘per-user’ basis. Furthermore, we investigate on different ways to interpret personalization of classification results by analyzing well known text datasets and exploring existent classifiers. We present solutions for the scalable classification problem based on standard classification techniques and present an algorithm that relies on the semantic analysis using document decomposition into its sentences. Experimental results concerning the scalability property and the performance of these algorithms are provided using the 20newsgroup dataset and a dataset consisting of web news.

[1]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Ioannis Antonellis,et al.  Personalized News Categorization Through Scalable Text Classification , 2006, APWeb.

[7]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[8]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[9]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[10]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[11]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[12]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[13]  Efstratios Gallopoulos,et al.  Design of a matlab tool-box for term-document matrix generation , 2005 .

[14]  George W. Furnas,et al.  Pictures of relevance: A geometric analysis of similarity measures , 1987, J. Am. Soc. Inf. Sci..