Feature selection and classification for high-dimensional biological data under cross-validation framework

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight but also posed analytical challenges. One important problem is selecting the informative feature subset and predicting the future outcome. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the K-fold cross-validation method.

[1]  H. Kimura,et al.  Enhanced expression of Mcm proteins in cancer cells derived from uterine cervix. , 2003, European journal of biochemistry.

[2]  J. Rader,et al.  Genomics of cervical cancer and the role of human papillomavirus pathobiology. , 2014, Clinical chemistry.

[3]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[4]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[5]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[6]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[7]  Manu Mangal,et al.  Understanding the transcriptional regulation of cervix cancer using microarray gene expression data and promoter sequence analysis of a curated gene set. , 2014, Gene.

[8]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[9]  Fernando De la Torre,et al.  Optimal feature selection for support vector machines , 2010, Pattern Recognit..

[10]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[11]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[12]  Sharon O'Toole,et al.  Gene expression profiling in cervical cancer: identification of novel markers for disease diagnosis and therapy. , 2009, Methods in molecular biology.

[13]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[14]  Alireza Osareh,et al.  An Efficient Ensemble Learning Method for Gene Microarray Classification , 2013, BioMed research international.

[15]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[16]  M. Mansukhani,et al.  Identification of copy number gain and overexpressed genes on chromosome arm 20q by an integrative genomic approach in cervical cancer: Potential role in progression , 2008, Genes, chromosomes & cancer.

[17]  Y. Liu,et al.  Endothelin-3 growth factor levels decreased in cervical cancer compared with normal cervical epithelial cells. , 2007, Human pathology.

[18]  Eigo Otsuji,et al.  Microarray Technology and Its Applications for Detecting Plasma microRNA Biomarkers in Digestive Tract Cancers. , 2016, Methods in molecular biology.

[19]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[20]  Ljubomir J. Buturovic,et al.  Cross-validation pitfalls when selecting and assessing regression and classification models , 2014, Journal of Cheminformatics.

[21]  E. Kubista,et al.  cDNA array analysis of cytobrush-collected normal and malignant cervical epithelial cells: a feasibility study. , 2005, Cancer genetics and cytogenetics.

[22]  Tzu-Hao Wang,et al.  Molecular characterization of adenocarcinoma and squamous carcinoma of the uterine cervix using microarray analysis of gene expression , 2006, International journal of cancer.

[23]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[24]  L. Bullinger,et al.  Gene expression profiling in AML with normal karyotype can predict mutations for molecular markers and allows novel insights into perturbed biological pathways , 2010, Leukemia.

[25]  Wessel N van Wieringen,et al.  Integrated genomic and transcriptional profiling identifies chromosomal loci with altered gene expression in cervical cancer , 2008, Genes, chromosomes & cancer.

[26]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[27]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[28]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[29]  M. Emi,et al.  Down‐regulation of members of glycolipid‐enriched membrane raft gene family, MAL and BENE, in cervical squamous cell cancers , 2004, The journal of obstetrics and gynaecology research.

[30]  R H Hruban,et al.  Gene expression profiles in normal and cancer cells. , 1997, Science.

[31]  Hitoshi Iba,et al.  Extraction of informative genes from microarray data , 2005, GECCO '05.

[32]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[33]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[34]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[35]  David I. Smith,et al.  Genome‐wide gene expression profiling of cervical cancer in Hong Kong women by oligonucleotide microarray , 2006, International journal of cancer.

[36]  A. Longatto-Filho,et al.  The Association of p16INK4A and Fragile Histidine Triad Gene Expression and Cervical Lesions , 2007, Journal of lower genital tract disease.

[37]  A. Ashworth,et al.  Microarray and histopathological analysis of tumours: the future and the past? , 2001, Nature Reviews Cancer.

[38]  Vipin Kumar,et al.  Feature Selection: A literature Review , 2014, Smart Comput. Rev..

[39]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[40]  M. Rietschel,et al.  Neuropsychosocial profiles of current and future adolescent alcohol misusers , 2014, Nature.

[41]  Harald zur Hausen,et al.  Papillomavirus infections — a major cause of human cancers , 1996 .

[42]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[43]  Steven I Hajdu,et al.  A note from history: Landmarks in history of cancer, part 1 , 2011, Cancer.

[44]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[45]  Gajendra P. S. Raghava,et al.  CCDB: a curated database of genes involved in cervix cancer , 2010, Nucleic Acids Res..

[46]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[47]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[48]  A. Fink-Retter,et al.  Gene profiling in Pap-cell smears of high-risk human papillomavirus-positive squamous cervical carcinoma. , 2007, Gynecologic oncology.

[49]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[50]  Philip M. Long,et al.  Boosting and Microarray Data , 2003, Machine Learning.

[51]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[52]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[53]  L. David,et al.  Keratins 8, 10, 13, and 17 are useful markers in the diagnosis of human cervix carcinomas. , 2004, Human pathology.

[54]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[55]  D. Spandidos,et al.  Deregulation of the G1/S phase transition in cancer and squamous intraepithelial lesions of the uterine cervix: a case control study. , 2008, Oncology reports.

[56]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[57]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[58]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[59]  P. Nelson,et al.  Predicting prostate cancer behavior using transcript profiles. , 2004, The Journal of urology.

[60]  Taghi M. Khoshgoftaar,et al.  A Review of Ensemble Classification for DNA Microarrays Data , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.

[61]  N. Lévêque,et al.  The microarray technology: facts and controversies , 2012, Clinical Microbiology and Infection.

[62]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[63]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[64]  M. Lindström,et al.  Predictive Significance of the Alterations of p16INK4A, p14ARF, p53, and Proliferating Cell Nuclear Antigen Expression in the Progression of Cervical Cancer , 2004, Clinical Cancer Research.