Hierarchical cluster analysis of SAGE data for cancer profiling

In this paper we present a method for clustering SAGE (Serial Analysis of Gene Expression) data to detect similarities and dissimilarities between different types of cancer on the sub-cellular level. The data, however, is extremely high dimensional, and due to the method of measurement, there are many errors as well as missing values in the data, challenging any clustering algorithm. Therefore, we introduce special pre-processing techniques to reduce these errors and to restore missing data. These techniques are tailored to the process that generates the data, making only very conservative changes. Furthermore, we present a new subspace selection technique to identify a relevant subset of attributes (genes) using the Wilcoxon test. This is a general technique that can be applied to select subspaces for the purpose of clustering whenever some high-level categories of interest are known for the data (such as cancerous and non-cancerous). Finally, we discuss the results of the application of the clustering algorithm OPTICS to the SAGE data, before and after our preprocessing steps.

[1]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[2]  R. Strausberg,et al.  The cancer genome anatomy project: building an annotated gene index. , 2000, Trends in genetics : TIG.

[3]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[4]  G D Schuler,et al.  Molecular profiling of clinical tissue specimens: feasibility and applications. , 2000, The American journal of pathology.

[5]  S. Altschul,et al.  A public database for gene expression in human cancers. , 1999, Cancer research.

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  H. V. Jagadish,et al.  Semantic Compression and Pattern Extraction with Fascicles , 1999, VLDB.

[8]  K. A. Semendyayev,et al.  Handbook of mathematics , 1985 .

[9]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[11]  S. Altschul,et al.  SAGEmap: a public gene expression resource. , 2000, Genome research.

[12]  R H Hruban,et al.  Gene expression profiles in normal and cancer cells. , 1997, Science.

[13]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  J. Stollberg,et al.  A quantitative evaluation of SAGE. , 2000, Genome research.

[16]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[17]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.