Redundancy-based feature selection for high-dimensional data and application in bioinformatics

This dissertation studies feature selection: the problem of selecting a subset of features from the original ones in a data set. In many applications such as genomic microarray analysis and text document categorization, data often contains thousands of features. Many of them can be irrelevant or redundant to classification tasks. In order for learning algorithms to perform efficiently and effectively on high-dimensional data, it is imperative to remove irrelevant and redundant features. The data characteristics of high-dimensional data hinder the success of many applications and pose severe challenges for existing feature selection methods focusing on feature relevance analysis. This dissertation describes solutions for efficiently handling redundant features. First, it shows that feature relevance alone is insufficient for efficient and effective feature selection for high-dimensional data. Next, it proposes a systematic framework to perform explicit redundancy analysis in feature selection. Under the framework, it further introduces three feature selection algorithms, Fast Correlation Based Filter (FCBF), Redundancy Based Filter (RBF), and Reporter Surrogate Variable Program (RSVP), handling various types of high-dimensional data. FCBF allows for efficient selection of relevant but non-redundant features. When applied to gene expression microarray data, RBF can efficiently identify a small set of discriminative genes for accurate classification of biological samples. Based on a real-world problem of glioma migration, this dissertation discusses results of RSVP that bridges the gap between statistically significant findings and biologically significant insights.