Parallel GA-based wrapper feature selection for spectroscopic data mining

Mining predictive models in dense databases is CPU time consuming and I/O intensive. In this paper, we propose a taxonomy of existing techniques allowing to achieve high performance. We propose a hybrid approach allowing to exploit four of them: feature selection, GA-based exploration space reduction, parallelism and concurrency. The approach is experimented on a near-infrared (NIR) spectroscopic application. It consists of predicting the concentration of a given component in a given product from its absorbances to NIR radiations. Statistical methods, like PLS, are well-suited and efficient for such data mining task. The experimental results show that preceding those methods with a feature selection allows to withdraw a significant number of irrelevant features and at the same time to enhance significantly the accuracy of the discovered predictive model. It is also shown that for the considered task the GA-based approach allows to build more accurate models than neural networks. Moreover, the parallel multithreaded implementation of the approach allows a linear speed-up.

[1]  Kenneth A. De Jong,et al.  Genetic algorithms as a tool for feature selection in machine learning , 1992, Proceedings Fourth International Conference on Tools with Artificial Intelligence TAI '92.

[2]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[3]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[4]  Huan Liu,et al.  Towards an evolutionary algorithm: a comparison of two feature selection algorithms , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[5]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection , 1998 .

[6]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[7]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[8]  S. Wold,et al.  The multivariate calibration problem in chemistry solved by the PLS method , 1983 .

[9]  Frank Mueller,et al.  A Library Implementation of POSIX Threads under UNIX , 1993, USENIX Winter.

[10]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[11]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[12]  M. Pernice,et al.  PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing [Book Review] , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[13]  El-Ghazali Talbi,et al.  A parallel genetic algorithm for rule mining , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[14]  Alex Alves Freitas,et al.  On rule interestingness measures , 1999, Knowl. Based Syst..

[15]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[16]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[17]  Kyuseok Shim,et al.  Developing Tightly-Coupled Data Mining Applications on a Relational Database System , 1996, KDD.

[18]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[19]  Vipin Kumar,et al.  Scalable Parallel Data Mining for Association Rules , 2000, IEEE Trans. Knowl. Data Eng..