A feature selection method for classification within functional genomics experiments based on the proportional overlapping score

BackgroundMicroarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature’s relevance to a classification task.ResultsWe apply POS, along‐with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.ConclusionsA novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along‐with a novel gene score are exploited to produce the selected subset of genes.

[1]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[2]  Berthold Lausen,et al.  Quantitative proteome profiling of lymph node‐positive vs. ‐negative colorectal carcinomas pinpoints MX1 as a marker for lymph node metastasis , 2014, International journal of cancer.

[3]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[5]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[6]  Ludwig Lausser,et al.  Measuring and visualizing the stability of biomarker selection techniques , 2011, Computational Statistics.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Sampsa Hautaniemi,et al.  Candidate driver genes in microsatellite‐unstable colorectal cancer , 2012, International journal of cancer.

[9]  Elena Baralis,et al.  MaskedPainter: Feature selection for microarray data analysis , 2012, Intell. Data Anal..

[10]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[11]  E. Baralis,et al.  The Painter's Feature Selection for Gene Expression Data , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[12]  W. Sauerbrei,et al.  Dangers of using "optimal" cutpoints in the evaluation of prognostic factors. , 1994, Journal of the National Cancer Institute.

[13]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[14]  Honglin Li,et al.  An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis , 2012, BMC Bioinformatics.

[15]  Benjamin Haibe-Kains,et al.  mRMRe: an R package for parallelized mRMR ensemble feature selection , 2013, Bioinform..

[16]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[17]  Frank Bretz,et al.  Assessment of Optimal Selected Prognostic Factors , 2002 .

[18]  DramińskiMichał,et al.  Monte Carlo feature selection for supervised classification , 2008 .

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[21]  Hiroshi Tanaka,et al.  Identification of NUCKS1 as a colorectal cancer prognostic marker through integrated expression and copy number analysis , 2013, International journal of cancer.

[22]  R. Gentleman,et al.  Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. , 2004, Blood.

[23]  D. Liang,et al.  Comparison of Feature Selection Methods for Cross-Laboratory Microarray Analysis , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[25]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[26]  Ludwig Lausser,et al.  Multi-Objective Parameter Selection for Classifiers , 2012 .

[27]  T. Ørntoft,et al.  Metastasis-Associated Gene Expression Changes Predict Poor Outcomes in Patients with Dukes Stage B and C Colorectal Cancer , 2009, Clinical Cancer Research.

[28]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[29]  Elena Baralis,et al.  Minimum number of genes for microarray feature selection , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[30]  Joanna Polanska,et al.  Adaptive filtering of microarray gene expression data based on Gaussian mixture decomposition , 2013, BMC Bioinformatics.

[31]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[32]  Rudong Li,et al.  A Computational Study Identifies HIV Progression-Related Genes Using mRMR and Shortest Path Tracing , 2013, PloS one.

[33]  Wei-Chung Cheng,et al.  Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm , 2014, BMC Bioinformatics.

[34]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[35]  Hinrich W. H. Göhlmann,et al.  I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data , 2007, Bioinform..

[36]  L. Aaltonen,et al.  Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis , 2007, Oncogene.

[37]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[38]  Alfred Ultsch,et al.  A Comparison of Algorithms to Find Differentially Expressed Genes in Microarray Data , 2008, GfKl.

[39]  P. Bushel,et al.  Principal component analysis-based filtering improves detection for Affymetrix gene expression arrays , 2011, Nucleic acids research.

[40]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[41]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Denis Thieffry,et al.  RSAT 2011: regulatory sequence analysis tools , 2011, Nucleic Acids Res..

[43]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .

[44]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[45]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[46]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[47]  L. Klein-Hitpass,et al.  Molecular Signature for Lymphatic Metastasis in Colorectal Carcinomas , 2008, Annals of surgery.