Evaluation of Clustering Based on Preprocessing in Gene Expression Data

Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets. Keywords—Gene expression, Clustering, Data preprocessing.

[1]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[2]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[3]  Jae Won Lee,et al.  Ensemble clustering method based on the resampling similarity measure for gene expression data , 2007 .

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  Yingdong Zhao,et al.  An adaptive method for cDNA microarray normalization , 2004, BMC Bioinformatics.

[6]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[7]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[8]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[9]  Kurt Hornik,et al.  An Ensemble Method for Clustering , 2003 .

[10]  Debashis Ghosh,et al.  STATISTICAL ISSUES IN THE CLUSTERING OF GENE EXPRESSION DATA , 2001 .

[11]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[12]  Hongzhe Li,et al.  Clustering of time-course gene expression data using a mixed-effects model with B-splines , 2003, Bioinform..

[13]  Jae Won Lee,et al.  Ensemble clustering method based on the resampling similarity measure for gene expression data. , 2007, Statistical methods in medical research.

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[16]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[17]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[18]  John Quackenbush,et al.  Computational genetics: Computational analysis of microarray data , 2001, Nature Reviews Genetics.

[19]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[21]  Sanjit K. Mitra,et al.  Optimized LOWESS normalization parameter selection for DNA microarray data , 2004, BMC Bioinformatics.

[22]  Valerie Guralnik,et al.  A scalable algorithm for clustering protein sequences , 2001, BIOKDD.