RIFS: a randomly restarted incremental feature selection algorithm

The advent of big data era has imposed both running time and learning efficiency challenges for the machine learning researchers. Biomedical OMIC research is one of these big data areas and has changed the biomedical research drastically. But the high cost of data production and difficulty in participant recruitment introduce the paradigm of “large p small n” into the biomedical research. Feature selection is usually employed to reduce the high number of biomedical features, so that a stable data-independent classification or regression model may be achieved. This study randomly changes the first element of the widely-used incremental feature selection (IFS) strategy and selects the best feature subset that may be ranked low by the statistical association evaluation algorithms, e.g. t-test. The hypothesis is that two low-ranked features may be orchestrated to achieve a good classification performance. The proposed Randomly re-started Incremental Feature Selection (RIFS) algorithm demonstrates both higher classification accuracy and smaller feature number than the existing algorithms. RIFS also outperforms the existing methylomic diagnosis model for the prostate malignancy with a larger accuracy and a lower number of transcriptomic features.

[1]  José M Ferro,et al.  TTC7B Emerges as a Novel Risk Factor for Ischemic Stroke Through the Convergence of Several Genome-Wide Approaches , 2012, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[2]  Wei-Min Liu,et al.  Analysis of high density expression microarrays with signed-rank call algorithms , 2002, Bioinform..

[3]  Shizhong Xu,et al.  Estimation and Inference in Generalized Additive Coefficient Models for Nonlinear Interactions with High-Dimensional Covariates. , 2015, Annals of statistics.

[4]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[5]  Henrik Zetterberg,et al.  Genome-wide, high-content siRNA screening identifies the Alzheimer’s genetic risk factor FERMT2 as a major modulator of APP metabolism , 2016, Acta Neuropathologica.

[6]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[7]  Ying Xu,et al.  cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data , 2010, Bioinform..

[8]  Guoqing Wang,et al.  McTwo: a two-step feature selection algorithm based on maximal information coefficient , 2016, BMC Bioinformatics.

[9]  Zhijun Xie,et al.  Methylome-wide Association Study of Atrial Fibrillation in Framingham Heart Study , 2017, Scientific Reports.

[10]  Tanya Barrett,et al.  The Gene Expression Omnibus Database , 2016, Statistical Genomics.

[11]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  J.,et al.  The New England Journal of Medicine , 2012 .

[13]  Zhonghu Bai,et al.  Cancer Hallmarks, Biomarkers and Breast Cancer Molecular Subtypes , 2016, Journal of Cancer.

[14]  Martin J. Hessner,et al.  Transcriptional Signatures as a Disease-Specific and Predictive Inflammatory Biomarker for Type 1 Diabetes , 2012, Genes and Immunity.

[15]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[16]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[17]  Bauke Ylstra,et al.  Comprehensive genomic meta-analysis identifies intra-tumoural stroma as a predictor of survival in patients with gastric cancer , 2012, Gut.

[18]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[19]  F. Zhan,et al.  The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. , 2003, The New England journal of medicine.

[20]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[21]  Jingyuan Fu,et al.  GWAS as a Driver of Gene Discovery in Cardiometabolic Diseases , 2015, Trends in Endocrinology & Metabolism.

[22]  Silvia Casado Yusta,et al.  Different metaheuristic strategies to solve the feature selection problem , 2009, Pattern Recognit. Lett..

[23]  William Wheeler,et al.  Genome-wide interaction study of smoking and bladder cancer risk. , 2014, Carcinogenesis.

[24]  M. Kosorok,et al.  Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data , 2005, math/0508219.

[25]  Lei Chen,et al.  Gene expression profiling gut microbiota in different races of humans , 2016, Scientific Reports.

[26]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[27]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[28]  Jagdish Chandra Patra,et al.  Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network , 2010, Bioinform..

[29]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[30]  H. Schiöth,et al.  A methylome-wide mQTL analysis reveals associations of methylation sites with GAD1 and HDAC3 SNPs and a general psychiatric risk score , 2017, Translational Psychiatry.

[31]  R. Gentleman,et al.  Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. , 2004, Blood.

[32]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[33]  Charles Elkan,et al.  Optimal Thresholding of Classifiers to Maximize F1 Measure , 2014, ECML/PKDD.

[34]  Peter X K Song,et al.  Study design in high-dimensional classification analysis. , 2016, Biostatistics.

[35]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[36]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[37]  Marina Pollán,et al.  Sources of error and its control in studies on the diagnostic accuracy of “‐omics” technologies , 2009, Proteomics. Clinical applications.

[38]  Huaiqing Wang,et al.  A discretization algorithm based on a heterogeneity criterion , 2005, IEEE Transactions on Knowledge and Data Engineering.

[39]  N. Hu,et al.  Comparison of Global Gene Expression of Gastric Cardia and Noncardia Cancers from a High-Risk Population in China , 2013, PloS one.

[40]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[41]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[42]  Patrick Siarry,et al.  Computing Molecular Signatures as Optima of a Bi-Objective Function: Method and Application to Prediction in Oncogenomics , 2015, Cancer informatics.

[43]  Guoqing Wang,et al.  Gene expression profile based classification models of psoriasis. , 2014, Genomics.

[44]  M. Mikuła,et al.  DNA methylation status is more reliable than gene expression at detecting cancer in prostate biopsy , 2014, British Journal of Cancer.

[45]  N. Hashimoto,et al.  Gene Expression-Based Molecular Diagnostic System for Malignant Gliomas Is Superior to Histological Diagnosis , 2007, Clinical Cancer Research.

[46]  James J. Chen,et al.  Development of biomarker classifiers from high-dimensional data , 2009, Briefings Bioinform..

[47]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[48]  Nabil Belacel,et al.  Multi-gene biomarker panel for reference free prostate cancer diagnosis: determination and independent validation , 2010, Biomarkers : biochemical indicators of exposure, response, and susceptibility to chemicals.

[49]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[50]  Gautam Maulik,et al.  Diabetes, oxidative stress, molecular mechanism, and cardiovascular disease – an overview , 2012, Toxicology mechanisms and methods.

[51]  Mohamed F. Ghalwash,et al.  Minimum redundancy maximum relevance feature selection approach for temporal gene expression data , 2017, BMC Bioinformatics.