An enhanced ACO algorithm to select features for text categorization and its parallelization

Feature selection is an indispensable preprocessing step for effective analysis of high dimensional data. It removes irrelevant features, improves the predictive accuracy and increases the comprehensibility of the model constructed by the classifiers sensitive to features. Finding an optimal feature subset for a problem in an outsized domain becomes intractable and many such feature selection problems have been shown to be NP-hard. Optimization algorithms are frequently designed for NP-hard problems to find nearly optimal solutions with a practical time complexity. This paper formulates the text feature selection problem as a combinatorial problem and proposes an Ant Colony Optimization (ACO) algorithm to find the nearly optimal solution for the same. It differs from the earlier algorithm by Aghdam et al. by including a heuristic function based on statistics and a local search. The algorithm aims at determining a solution that includes 'n' distinct features for each category. Optimization algorithms based on wrapper models show better results but the processes involved in them are time intensive. The availability of parallel architectures as a cluster of machines connected through fast Ethernet has increased the interest on parallelization of algorithms. The proposed ACO algorithm was parallelized and demonstrated with a cluster formed with a maximum of six machines. Documents from 20 newsgroup benchmark dataset were used for experimentation. Features selected by the proposed algorithm were evaluated using Naive bayes classifier and compared with the standard feature selection techniques. It was observed that the performance of the classifier had been improved with the features selected by the enhanced ACO and local search. Error of the classifier decreases over iterations and it was observed that the number of positive features increases with the number of iterations.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  T. Stützle,et al.  A Review on the Ant Colony Optimization Metaheuristic: Basis, Models and New Trends , 2002 .

[3]  Richard F. Hartl,et al.  An improved Ant System algorithm for theVehicle Routing Problem , 1999, Ann. Oper. Res..

[4]  M. Dorigo,et al.  The Ant Colony Optimization MetaHeuristic 1 , 1999 .

[5]  J. Deneubourg,et al.  The self-organizing exploratory pattern of the argentine ant , 1990, Journal of Insect Behavior.

[6]  George Forman Feature Selection for Text Classification , 2007 .

[7]  B. Bullnheimer,et al.  A NEW RANK BASED VERSION OF THE ANT SYSTEM: A COMPUTATIONAL STUDY , 1997 .

[8]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[9]  Daniel Merkle,et al.  Parallel Ant Colony Algorithms , 2005 .

[10]  Marco Dorigo,et al.  AntNet: Distributed Stigmergetic Control for Communications Networks , 1998, J. Artif. Intell. Res..

[11]  Dorothy Ndedi Monekosso,et al.  A review of ant algorithms , 2009, Expert Syst. Appl..

[12]  Alex Alves Freitas,et al.  Web Page Classification with an Ant Colony Algorithm , 2004, PPSN.

[13]  Marco Dorigo,et al.  Ant colony optimization , 2006, IEEE Computational Intelligence Magazine.

[14]  Luca Maria Gambardella,et al.  Ant colony system: a cooperative learning approach to the traveling salesman problem , 1997, IEEE Trans. Evol. Comput..

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Marco Dorigo,et al.  Ant system for Job-shop Scheduling , 1994 .

[17]  Nasser Ghasem-Aghaee,et al.  A novel ACO-GA hybrid algorithm for feature selection in protein function prediction , 2009, Expert Syst. Appl..

[18]  Luca Maria Gambardella,et al.  A Study of Some Properties of Ant-Q , 1996, PPSN.

[19]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[20]  Hassan M. Emara,et al.  Using Ant Colony Optimization algorithm for solving project management problems , 2009, Expert Syst. Appl..

[21]  Thomas Stützle,et al.  MAX-MIN Ant System , 2000, Future Gener. Comput. Syst..

[22]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[23]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[24]  Manuel López-Ibáñez,et al.  Ant colony optimization , 2010, GECCO '10.

[25]  Chris H. Q. Ding,et al.  Evolving Feature Selection , 2005, IEEE Intell. Syst..

[26]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[27]  Marco Dorigo,et al.  The ant colony optimization meta-heuristic , 1999 .

[28]  José Ranilla,et al.  Scoring and selecting terms for text categorization , 2005, IEEE Intelligent Systems.

[29]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..