Logic classification and feature selection for biomedical data

In this paper we investigate logic classification and related feature selection algorithms for large biomedical data sets. When the data is in binary/logic form, the feature selection problem can be formulated as a Set Covering problem of very large dimensions, whose solution is computationally challenging. We propose an alternative approximated formulation for feature selection that results in an extension of Set Covering of compact size, and use the logic classifier Lsquare to test its performances on two well-known data sets. An ad hoc metaheuristic of the GRASP type is used to solve efficiently the feature selection problem. A simple and effective method to convert rational data into logic data by interval mapping is also described. The computational results obtained are promising and the use of logic models, that can be easily understood and integrated with other domain knowledge, is one of the major strengths of this approach.

[1]  Celso C. Ribeiro,et al.  Greedy Randomized Adaptive Search Procedures , 2003, Handbook of Metaheuristics.

[2]  Andrew Kusiak,et al.  Data mining and genetic algorithm based gene/SNP selection , 2004, Artif. Intell. Medicine.

[3]  Mauricio G. C. Resende,et al.  Grasp: An Annotated Bibliography , 2002 .

[4]  Mauricio G. C. Resende,et al.  Greedy Randomized Adaptive Search Procedures , 1995, J. Glob. Optim..

[5]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[6]  Anna Tramontano,et al.  The ten most wanted solutions in protein bioinformatics , 2005 .

[7]  David A. Peterson,et al.  Model and feature selection in microarray classification , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[8]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[9]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[10]  Yu-Dong Cai,et al.  Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition , 2004, Bioinform..

[11]  M. Resende,et al.  A probabilistic heuristic for a computationally difficult set covering problem , 1989 .

[12]  Giovanni Felici,et al.  Feature Selection for Data Mining , 2006 .

[13]  I. Mian,et al.  Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. , 2001, Physiological genomics.

[14]  Klaus Truemper,et al.  A MINSAT Approach for Learning in Logic Domains , 2002, INFORMS J. Comput..

[15]  John Wang,et al.  Encyclopedia of Data Warehousing and Mining , 2005 .

[16]  Klaus Truemper,et al.  Lsquare System for Mining Logic Data , 2005 .

[17]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[18]  Evangelos Triantaphyllou,et al.  On the minimum number of logical clauses inferred from examples , 1996, Comput. Oper. Res..

[19]  Toshihide Ibaraki,et al.  Logical Analysis of Binary Data with Missing Bits , 1999, Artif. Intell..

[20]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[22]  B. J. Lageweg,et al.  Branch-and-Bound Algorithms for the Test Cover Problem , 2002, ESA.

[23]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[24]  Toshihide Ibaraki,et al.  An Implementation of Logical Analysis of Data , 2000, IEEE Trans. Knowl. Data Eng..

[25]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[26]  Evangelos Triantaphyllou,et al.  Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques , 2009 .