Gene Expression Clustering: a Novel Graph Partitioning Approach

In order to help understand how the genes are affected by different disease conditions in a biological system, clustering is typically performed to analyze gene expression data. In this paper, we propose to solve the clustering problem using a graph theoretical approach, and apply a novel graph partitioning model -isoperimetric graph partitioning (IGP), to group biological samples from gene expression data. The IGP algorithm has several advantages compared to the well-established spectral graph partitioning (SGP) model. First, IGP requires a simple solution to a sparse system of linear equations instead of the eigen-problem in the SGP model. Second, IGP avoids degenerate cases produced by spectral approach to achieve a partition with higher accuracy. Moreover, we integrate unsupervised gene selection into the proposed approach through two-way ordering of gene expression data, such that we can eliminate irrelevant or redundant genes in the data and obtain an improved clustering result. We evaluate our approach on several well-known problems involving gene expression profiles of colon cancer and leukemia subtypes. Our experiment results demonstrate that IGP constantly outperforms SGP and produces a better result that is closer to the original labeling of sample sets provided by domain experts. Furthermore, the clustering accuracy is improved significantly when IGP is integrated with the unsupervised gene (feature) selection.

[1]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[2]  Sudeep Sarkar,et al.  Supervised Learning of Large Perceptual Organization: Graph Spectral Partitioning and Learning Automata , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Leo Grady,et al.  Isoperimetric Partitioning: A New Algorithm for Graph Partitioning , 2005, SIAM J. Sci. Comput..

[4]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[5]  M. Fiedler Eigenvectors of acyclic matrices , 1975 .

[6]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .

[7]  M. Fiedler A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory , 1975 .

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  J. Cheeger A lower bound for the smallest eigenvalue of the Laplacian , 1969 .

[10]  Chris H. Q. Ding,et al.  Unsupervised Feature Selection Via Two-way Ordering in Gene Expression Analysis , 2003, Bioinform..

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  Bojan Mohar,et al.  Isoperimetric numbers of graphs , 1989, J. Comb. Theory, Ser. B.

[13]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  M. Fiedler Special matrices and their applications in numerical mathematics , 1986 .

[16]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[17]  A. Hoffman,et al.  Lower bounds for the partitioning of graphs , 1973 .

[18]  Horst D. Simon,et al.  Partitioning of unstructured problems for parallel processing , 1991 .

[19]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[20]  Farshad Fotouhi,et al.  Co-clustering Documents and Words Using Bipartite Isoperimetric Graph Partitioning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Y. F. Hu,et al.  Numerical Experiences with Partitioning of Unstructured Meshes , 1994, Parallel Comput..

[22]  Leo Grady,et al.  Isoperimetric graph partitioning for image segmentation , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Chris H. Q. Ding,et al.  Analysis of gene expression profiles: class discovery and leaf ordering , 2002, RECOMB '02.

[24]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[26]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[27]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[28]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[29]  Gary L. Miller,et al.  Geometric Mesh Partitioning: Implementation and Experiments , 1998, SIAM J. Sci. Comput..

[30]  Gene H. Golub,et al.  Matrix computations , 1983 .

[31]  Andrew B. Kahng,et al.  Recent directions in netlist partitioning: a survey , 1995, Integr..

[32]  E. Schwartz,et al.  Isoperimetric Graph Partitioning for Data Clustering and Image Segmentation , 2003 .

[33]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.