Weighted random subspace method for high dimensional data classification.

High dimensional data, especially those emerging from genomics and proteomics studies, pose significant challenges to traditional classification algorithms because the performance of these algorithms may substantially deteriorate due to high dimensionality and existence of many noisy features in these data. To address these problems, pre-classification feature selection and aggregating algorithms have been proposed. However, most feature selection procedures either fail to consider potential interactions among the features or tend to over fit the data. The aggregating algorithms, e.g. the bagging predictor, the boosting algorithm, the random subspace method, and the Random Forests algorithm, are promising in handling high dimensional data. However, there is a lack of attention to optimal weight assignments to individual classifiers and this has prevented these algorithms from achieving better classification accuracy. In this article, we formulate the weight assignment problem and propose a heuristic optimization solution.We have applied the proposed weight assignment procedures to the random subspace method to develop a weighted random subspace method. Several public gene expression and mass spectrometry data sets at the Kent Ridge biomedical data repository have been analyzed by this novel method. We have found that significant improvement over the common equal weight assignment scheme may be achieved by our method.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  D. Gerhold,et al.  Monitoring expression of genes involved in drug metabolism and toxicology using DNA microarrays. , 2001, Physiological genomics.

[3]  E. Kleinberg An overtraining-resistant stochastic modeling method for pattern recognition , 1996 .

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Gabriele Steidl,et al.  Combined SVM-Based Feature Selection and Classification , 2005, Machine Learning.

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  E. M. Kleinberg,et al.  Stochastic discrimination , 1990, Annals of Mathematics and Artificial Intelligence.

[8]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[9]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[13]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[14]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian Cancer , 2002 .