Feature Cluster Selection for High-Throughput Data Analysis

Although feature selection has proven effective in sample class prediction, it is not adequate for identifying leads for potentially useful biomarkers by high-throughput biological data analysis. The large number of equally good predictive sets and the disparity among them reveals the gap between feature selection and biomarker identification. We propose to bridge this gap by a new data mining task, feature cluster selection, which aims to select and group all relevant features in a data set into a small number of coherent clusters. We provide both theoretical framework and empirical formulation for the new problem, and propose the 3M algorithm. Experiments on microarray data show that the algorithm can select highly predictive representative gene sets and discover gene clusters with statistical significance.

[1]  Ronald K. Pearson,et al.  Quantitative Evaluation of Clustering Results Using Computational Negative Controls , 2004 .

[2]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[3]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[5]  Paul Ogilvie,et al.  Acrophile: an automated acronym extractor and server , 2000, DL '00.

[6]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[7]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[8]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[9]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[10]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[11]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[12]  M S Pepe,et al.  Phases of biomarker development for early detection of cancer. , 2001, Journal of the National Cancer Institute.

[13]  Ian Witten,et al.  Data Mining , 2000 .

[14]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Huan Liu,et al.  Fostering Biological Relevance in Feature Selection for Microarray Data , 2005 .

[16]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[17]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[19]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[21]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[22]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[23]  Yuji Matsumoto,et al.  Use of Support Vector Learning for Chunk Identification , 2000, CoNLL/LLL.

[24]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[25]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[26]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[27]  Toshihisa Takagi,et al.  Research Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE , 2005, J. Am. Medical Informatics Assoc..