On the classification of microarray gene-expression data

We consider the classification of microarray gene-expression data. First, attention is given to the supervised case, where the tissue samples are classified with respect to a number of predefined classes and the intent is to assign a new unclassified tissue to one of these classes. The problems of forming a classifier and estimating its error rate are addressed in the context of there being a relatively small number of observations (tissue samples) compared to the number of variables (that is, the genes, which can number in the tens of thousands). We then proceed to the unsupervised case and consider the clustering of the tissue samples and also the clustering of the gene profiles. Both problems can be viewed as being non-standard ones in statistics and we address some of the key issues involved. The focus is on the use of mixture models to effect the clustering for both problems.

[1]  Ian A. Wood,et al.  On selection biases with prediction rules formed from gene expression data , 2008 .

[2]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[3]  L. Hubert,et al.  Comparing partitions , 1985 .

[4]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[5]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[6]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[7]  Murray Aitkin,et al.  Statistical Modelling of Data on Teaching Styles , 1981 .

[8]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[9]  Jun S. Liu,et al.  Bayesian Clustering with Variable and Transformation Selections , 2003 .

[10]  F. Marriott The interpretation of multiple observations , 1974 .

[11]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Christophe Ambroise,et al.  Selection bias in working with the top genes in supervised classification of tissue samples , 2006 .

[13]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[14]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[15]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[16]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[17]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[18]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  Geoffrey J. McLachlan,et al.  Mixtures of common t-factor analyzers for clustering high-dimensional microarray data , 2011, Bioinform..

[22]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[23]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[24]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[25]  Geoffrey J. McLachlan,et al.  Mixtures of Factor Analyzers with Common Factor Loadings: Applications to the Clustering and Visualization of High-Dimensional Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  G. J. McLachlan,et al.  Correcting for selection bias via cross-validation in the classification of microarray data , 2008, 0805.2501.

[27]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[28]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[29]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[30]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[31]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[32]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[33]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .