Leukemia and small round blue-cell tumor cancer detection using microarray gene expression data set: Combining data dimension reduction and variable selection technique

Abstract Using gene expression data in cancer classification plays an important role for solving the fundamental problems relating to cancer diagnosis. Because of high throughput of gene expression data for healthy and patient samples, a variable selection method can be applied to reduce complexity of the model and improve the classification performance. Since variable selection procedures pose a risk of over-fitting, when a large number of variables with respect to sample are used, we have proposed a method for coupling data dimension reduction and variable selection in the present study. This approach uses the concept of variable clustering for the original data set. Significant components of local principal component analysis models have just been retained from all clusters. Then, the variable selection algorithm is performed on these locally derived principal component variables. The proposed algorithm has been evaluated on two gene expression data sets; namely, acute Leukemia and small round blue-cell tumor (SRBCT). Our results confirmed that the classification models achieved on the reduced data were better than those obtained on the entire microarray gene expression profile.

[1]  J. M. Deutsch,et al.  Algorithm for Finding Optimal Gene Sets in Microarray Prediction , 2001, physics/0108011.

[2]  K. Varmuza,et al.  Feature selection by genetic algorithms for mass spectral classifiers , 2001 .

[3]  T. Macalma,et al.  Molecular Characterization of Human Zyxin* , 1996, The Journal of Biological Chemistry.

[4]  Y. Honma,et al.  Differentiation inhibitory factor Nm23 as a prognostic factor for acute myeloid leukemia. , 1998, Leukemia & lymphoma.

[5]  P. Filzmoser,et al.  Repeated double cross validation , 2009 .

[6]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[7]  R. Leardi Genetic algorithms in chemometrics and chemistry: a review , 2001 .

[8]  Erik Johansson,et al.  Detection of ovarian cancer using chemometric analysis of proteomic profiles , 2006 .

[9]  J Tímár,et al.  [Expression of metastasis associated proteins, CD44v6 and NM23-H1, in pediatric acute lymphoblastic leukemia] , 2001, Magyar onkologia.

[10]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[11]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Qing-Song Xu,et al.  Random frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. , 2012, Analytica chimica acta.

[13]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[14]  L. Resar,et al.  The HMG-I Oncogene Causes Highly Penetrant, Aggressive Lymphoid Malignancy in Transgenic Mice and Is Overexpressed in Human Leukemia , 2004, Cancer Research.

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  R. Brereton,et al.  Self-Organizing Maps and Support Vector Regression as aids to coupled chromatography: illustrated by predicting spoilage in apples using volatile organic compounds. , 2011, Talanta: The International Journal of Pure and Applied Analytical Chemistry.

[17]  Anton Berns,et al.  Cancer: Gene expression in diagnosis , 2000, Nature.

[18]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[19]  D. Harlan,et al.  The human myristoylated alanine-rich C kinase substrate (MARCKS) gene (MACS). Analysis of its gene product, promoter, and chromosomal localization. , 1991, The Journal of biological chemistry.

[20]  J. Topliss,et al.  Chance factors in studies of quantitative structure-activity relationships. , 1979, Journal of medicinal chemistry.

[21]  Jian Yang,et al.  Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data , 2013, Comput. Biol. Medicine.

[22]  S. Mustjoki,et al.  Spermidine/spermine N(1)-acetyltransferase activity associates with white blood cell count in myeloid leukemias. , 2014, Experimental hematology.

[23]  Desire L. Massart,et al.  Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data , 1996 .

[24]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[25]  Jian-hui Jiang,et al.  Unimodal transform of variables selected by interval segmentation purity for classification tree modeling of high-dimensional microarray data. , 2011, Talanta: The International Journal of Pure and Applied Analytical Chemistry.

[26]  Rasmus Bro,et al.  Classification of GC‐MS measurements of wines by combining data dimension reduction and variable selection techniques , 2008 .

[27]  J. Fitzgibbon,et al.  Development of a human acute myeloid leukaemia screening panel and consequent identification of novel gene mutation in FLT3 and CCND3 , 2005, British journal of haematology.

[28]  In-Beum Lee,et al.  Optimal Approach for Classification of Acute Leukemia Subtypes Based on Gene Expression Data , 2002, Biotechnology progress.

[29]  Kuldip K. Paliwal,et al.  Cancer classification by gradient LDA technique using microarray gene expression data , 2008, Data Knowl. Eng..

[30]  Jie Liang,et al.  Computational analysis of microarray gene expression profiles: clustering, classification, and beyond , 2002 .

[31]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[32]  Bahram Hemmateenejad,et al.  Construction of stable multivariate calibration models using unsupervised segmented principal component regression , 2011 .

[33]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[34]  R. Gillies,et al.  Why do cancers have high aerobic glycolysis? , 2004, Nature Reviews Cancer.

[35]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[36]  Y. Honma,et al.  Plasma levels of the differentiation inhibitory factor nm23-H1 protein and their clinical implications in acute myelogenous leukemia. , 2000, Blood.

[37]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[38]  Bahram Hemmateenejad,et al.  Clustering of variables in regression analysis: a comparative study between different algorithms , 2013 .

[39]  Jieping Ye,et al.  Generalized Linear Discriminant Analysis: A Unified Framework and Efficient Model Selection , 2008, IEEE Transactions on Neural Networks.

[40]  Janina Muller,et al.  The Data Analysis Handbook , 2016 .

[41]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[42]  A. Ashworth,et al.  Microarray and histopathological analysis of tumours: the future and the past? , 2001, Nature Reviews Cancer.

[43]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[44]  Yudi Pawitan,et al.  Partial least squares and logistic regression random-effects estimates for gene selection in supervised classification of gene expression data , 2013, J. Biomed. Informatics.

[45]  B. Hemmateenejad,et al.  A segmented principal component analysis-regression approach to quantitative structure-activity relationship modeling. , 2009, Analytica chimica acta.

[46]  B. Chandrasekaran,et al.  On dimensionality and sample size in statistical pattern classification , 1971, Pattern Recognit..

[47]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.