Distributed Text Classification With an Ensemble Kernel-Based Learning Approach

Constructing a single text classifier that excels in any given application is a rather inviable goal. As a result, ensemble systems are becoming an important resource, since they permit the use of simpler classifiers and the integration of different knowledge in the learning process. However, many text-classification ensemble approaches have an extremely high computational burden, which poses limitations in applications in real environments. Moreover, state-of-the-art kernel-based classifiers, such as support vector machines and relevance vector machines, demand large resources when applied to large databases. Therefore, we propose the use of a new systematic distributed ensemble framework to tackle these challenges, based on a generic deployment strategy in a cluster distributed environment. We employ a combination of both task and data decomposition of the text-classification system, based on partitioning, communication, agglomeration, and mapping to define and optimize a graph of dependent tasks. Additionally, the framework includes an ensemble system where we exploit diverse patterns of errors and gain from the synergies between the ensemble classifiers. The ensemble data partitioning strategy used is shown to improve the performance of baseline state-of-the-art kernel-based machines. The experimental results show that the performance of the proposed framework outperforms standard methods both in speed and classification.

[1]  Ronald H. Perrott,et al.  Parallel programming , 1988, International computer science series.

[2]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[3]  Jun Zhang,et al.  An Ant Colony Optimization Approach to a Grid Workflow Scheduling Problem With Various QoS Requirements , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  Ian T. Foster,et al.  Designing and building parallel programs - concepts and tools for parallel software engineering , 1995 .

[5]  Bernardete Ribeiro,et al.  Support vector machines for quality monitoring in a plastic injection molding process , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[6]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[7]  Christopher M. Bishop,et al.  Bayesian Regression and Classification , 2003 .

[8]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[9]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[10]  R. V. van Nieuwpoort,et al.  The Grid 2: Blueprint for a New Computing Infrastructure , 2003 .

[11]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[12]  Bernardete Ribeiro,et al.  Speeding-up text categorization in a grid computing environment , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[13]  Yutaka Takahashi,et al.  Nursing-Care Freestyle Text Classification Using Support Vector Machines , 2007 .

[14]  Ning Liu,et al.  A Complete Multiagent Framework for Robust and Adaptable Dynamic Job Shop Scheduling , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Fabrizio Sebastiani Classification of Text, Automatic , 2006 .

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Andrew Blake,et al.  Sparse Bayesian learning for efficient visual tracking , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Bernardete Ribeiro,et al.  On Text-based Mining with Active Learning and Background Knowledge Using SVM , 2007, Soft Comput..

[19]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[20]  D. Madigan,et al.  Sparse Bayesian Classifiers for Text Categorization , 2003 .

[21]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[22]  Michael J. Quinn,et al.  Parallel programming in C with MPI and OpenMP , 2003 .

[23]  Robert M. Nishikawa,et al.  Relevance vector machine for automatic detection of clustered microcalcifications , 2005, IEEE Transactions on Medical Imaging.

[24]  Bernardete Ribeiro,et al.  Distributed Ensemble Learning in Text Classification , 2008, ICEIS.

[25]  Richard Johansson,et al.  Sparse Bayesian Classification of Predicate Arguments , 2005, CoNLL.

[26]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[27]  Ian Foster,et al.  Designing and building parallel programs , 1994 .

[28]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[29]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[30]  Daniel Grosu,et al.  Mercatus: A Toolkit for the Simulation of Market-Based Resource Allocation Protocols in Grids , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).