A between-Class Overlapping Filter-Based Method for transcriptome Data Analysis

Feature selection algorithms play a crucial role in identifying and discovering important genes for cancer classification. Feature selection algorithms can be broadly categorized into two main groups: filter-based methods and wrapper-based methods. Filter-based methods have been quite popular in the literature due to their many advantages, including computational efficiency, simplistic architecture, and an intuitively simple means of discovering biological and clinical aspects. However, these methods have limitations, and the classification accuracy of the selected genes is less accurate. In this paper, we propose a set of univariate filter-based methods using a between-class overlapping criterion. The proposed techniques have been compared with many other univariate filter-based methods using an acute leukemia dataset. The following properties have been examined: classification accuracy of the selected individual genes and the gene subsets; redundancy check among selected genes using ridge regression and LASSO methods; similarity and sensitivity analyses; functional analysis; and, stability analysis. A comprehensive experiment shows promising results for our proposed techniques. The univariate filter based methods using between-class overlapping criterion are accurate and robust, have biological significance, and are computationally efficient and easy to implement. Therefore, they are well suited for biological and clinical discoveries.

[1]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[2]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[3]  Sun-Yuan Kung,et al.  A Solution to the Curse of Dimensionality Problem in Pairwise Scoring Techniques , 2006, ICONIP.

[4]  Satoru Miyano,et al.  Null space based feature selection method for gene expression data , 2012, Int. J. Mach. Learn. Cybern..

[5]  Richard J. Fox,et al.  A two-sample Bayesian t-test for microarray data , 2006, BMC Bioinformatics.

[6]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[7]  Pedro Larrañaga,et al.  Feature Subset Selection by Bayesian network-based optimization , 2000, Artif. Intell..

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[10]  Jianzhong Li,et al.  The impact of sample imbalance on identifying differentially expressed genes , 2006, BMC Bioinformatics.

[11]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[12]  Jack Sklansky,et al.  On Automatic Feature Selection , 1988, Int. J. Pattern Recognit. Artif. Intell..

[13]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Satoru Miyano,et al.  Strategy of finding optimal number of features on gene expression data , 2011 .

[17]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[18]  Moshe Ben-Bassat,et al.  35 Use of distance measures, information measures and error bounds in feature evaluation , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[19]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[20]  Satoru Miyano,et al.  A Top-r Feature Selection Algorithm for Microarray Gene Expression Data , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[22]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[23]  Xuesong Lu,et al.  Significance of Gene Ranking for Classification of Microarray Samples , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[25]  Kuldip K. Paliwal,et al.  Rotational Linear Discriminant Analysis Technique for Dimensionality Reduction , 2008, IEEE Transactions on Knowledge and Data Engineering.

[26]  Hiroshi Mamitsuka,et al.  Selecting features in microarray classification using ROC curves , 2006, Pattern Recognit..

[27]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[28]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[29]  Francisco Azuaje,et al.  An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors , 2006, BMC Medical Informatics Decis. Mak..

[30]  Ian Witten,et al.  Data Mining , 2000 .