Rasch-based high-dimensionality data reduction and class prediction with applications to microarray gene expression data

Class prediction is an important application of microarray gene expression data analysis. The high-dimensionality of microarray data, where number of genes (variables) is very large compared to the number of samples (observations), makes the application of many prediction techniques (e.g., logistic regression, discriminant analysis) difficult. An efficient way to solve this problem is by using dimension reduction statistical techniques. Increasingly used in psychology-related applications, Rasch model (RM) provides an appealing framework for handling high-dimensional microarray data. In this paper, we study the potential of RM-based modeling in dimensionality reduction with binarized microarray gene expression data and investigate its prediction accuracy in the context of class prediction using linear discriminant analysis. Two different publicly available microarray data sets are used to illustrate a general framework of the approach. Performance of the proposed method is assessed by re-randomization scheme using principal component analysis (PCA) as a benchmark method. Our results show that RM-based dimension reduction is as effective as PCA-based dimension reduction. The method is general and can be applied to the other high-dimensional data problems.

[1]  Danh V. Nguyen,et al.  On partial least squares dimension reduction for microarray-based classification: a simulation study , 2004, Comput. Stat. Data Anal..

[2]  G. Rasch,et al.  An item analysis which takes individual differences into account. , 1966, The British journal of mathematical and statistical psychology.

[3]  Joshua E. S. Socolar,et al.  Global control of cell-cycle transcription by coupled CDK and network oscillators , 2008, Nature.

[4]  James J. Chen,et al.  Key aspects of analyzing microarray gene-expression data. , 2007, Pharmacogenomics.

[5]  Tao Yu,et al.  High-dimensional pseudo-logistic regression and classification with applications to gene expression data , 2007, Comput. Stat. Data Anal..

[6]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[7]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[9]  A. Boulesteix PLS Dimension Reduction for Classification with Microarray Data , 2004, Statistical applications in genetics and molecular biology.

[10]  Michael J. Owen,et al.  A comparison of four clustering methods for brain expression microarray data , 2008, BMC Bioinformatics.

[11]  Marco Botta,et al.  Microarray data analysis and mining approaches. , 2008, Briefings in functional genomics & proteomics.

[12]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[13]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[14]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[15]  M. Daumer,et al.  Evaluating Microarray-based Classifiers: An Overview , 2008, Cancer informatics.

[16]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[17]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[18]  Mourad Elloumi,et al.  Biclustering of Microarray Data , 2010 .

[19]  I. W. Molenaar,et al.  Rasch models: foundations, recent developments and applications , 1995 .

[20]  Jin Hwan Do,et al.  Clustering approaches to identifying gene expression patterns from DNA microarray data. , 2008, Molecules and cells.

[21]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[22]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[23]  Alexander J. Hartemink,et al.  Principled computational methods for the validation discovery of genetic regulatory networks , 2001 .

[24]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[25]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Sophia Rabe-Hesketh,et al.  Classical latent variable models for medical research , 2008, Statistical methods in medical research.

[27]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[28]  Hongzhe Li,et al.  Cluster-Rasch models for microarray gene expression data , 2001, Genome Biology.

[29]  Deniz Senturk-Doganaksoy,et al.  Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach , 2006, Technometrics.

[30]  Sophie Lambert-Lacroix,et al.  Effective dimension reduction methods for tumor classification using gene expression data , 2003, Bioinform..

[31]  F. Chiaromonte,et al.  Dimension reduction strategies for analyzing global gene expression data with a response. , 2002, Mathematical biosciences.

[32]  F De Smet,et al.  Balancing false positives and false negatives for the detection of differential expression in malignancies , 2004, British Journal of Cancer.

[33]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[35]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[36]  S. D. Chatterji Proceedings of the International Congress of Mathematicians , 1995 .

[37]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[38]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[39]  Blaz Zupan,et al.  Predictive data mining in clinical medicine: Current issues and guidelines , 2008, Int. J. Medical Informatics.

[40]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[41]  Šarūnas Raudys Measures of Data and Classifier Complexity and the Training Sample Size , 2006 .

[42]  Ruth M. Pfeiffer,et al.  Graphical Methods for Class Prediction Using Dimension Reduction Techniques on DNA Microarray Data , 2003, Bioinform..

[43]  Anne-Laure Boulesteix,et al.  CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data , 2008, BMC Bioinformatics.

[44]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[45]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[46]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[47]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[48]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[49]  R Simon,et al.  Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data , 2003, British Journal of Cancer.

[50]  Trevor Hastie,et al.  Clustering microarray data , 2003 .