Mean-entropy discretized features are effective for classifying high-dimensional biomedical data

This paper studies an empirical feature selection heuristics for classifying high-dimensional bio-medical data. A feature's discriminating power can be measured by its entropy value. Based on this idea, we do not consider those features that are ignored by the entropy idea. Such a selection can usually reduce the dimensionality of the data by 90-95%. Then we rank the remaining features, and select features whose entropy is smaller than the average of all the remaining features' entropies. This round of selection can usually further reduce two thirds of the features. So, we can achieve a reduction from tens of thousands of features to only hundreds of important features. Furthermore, we also observe that learning algorithms, including our new tree-committee classifier, generally improve their accuracy after the feature selection. This heuristics appears to be more systematic than the prevailing use of specific numbers of top-ranked features for classification.

[1]  Jinyan Li,et al.  Geography of Differences between Two Classes of Data , 2002, PKDD.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[4]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[5]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[6]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Huiqing Liu,et al.  Discovery of significant rules for classifying cancer diagnosis data , 2003, ECCB.

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[11]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[12]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[14]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[15]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .