Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices

SUMMARY We introduce a novel unsupervised approach for the organization and visualization of multidimensional data. At the heart of the method is a presentation of the full pairwise distance matrix of the data points, viewed in pseudocolor. The ordering of points is iteratively permuted in search of a linear ordering, which can be used to study embedded shapes. Several examples indicate how the shapes of certain structures in the data (elongated, circular and compact) manifest themselves visually in our permuted distance matrix. It is important to identify the elongated objects since they are often associated with a set of hidden variables, underlying continuous variation in the data. The problem of determining an optimal linear ordering is shown to be NP-Complete, and therefore an iterative search algorithm with O(n3) step-complexity is suggested. By using sorting points into neighborhoods, i.e. SPIN to analyze colon cancer expression data we were able to address the serious problem of sample heterogeneity, which hinders identification of metastasis related genes in our data. Our methodology brings to light the continuous variation of heterogeneity--starting with homogeneous tumor samples and gradually increasing the amount of another tissue. Ordering the samples according to their degree of contamination by unrelated tissue allows the separation of genes associated with irrelevant contamination from those related to cancer progression. AVAILABILITY Software package will be available for academic users upon request.

[1]  B. Baxter,et al.  Conditionally positive functions andp-norm distance matrices , 1991, 1006.2449.

[2]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[5]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[6]  G. Getz,et al.  Expression profiles of acute lymphoblastic and myeloblastic leukemias with ALL-1 rearrangements , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D. Chang,et al.  EPLIN, Epithelial protein lost in neoplasm , 1999, Oncogene.

[8]  U. Brinkmann,et al.  CSE1L/CAS: Its role in proliferation and apoptosis , 2004, Apoptosis.

[9]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[10]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[11]  Keith Wilson,et al.  Silence of chromosomal amplifications in colon cancer. , 2002, Cancer research.

[12]  Wolfram Liebermeister,et al.  Linear modes of gene expression determined by independent component analysis , 2002, Bioinform..

[13]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  Christos H. Papadimitriou,et al.  The Euclidean Traveling Salesman Problem is NP-Complete , 1977, Theor. Comput. Sci..

[16]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[17]  T. Koopmans,et al.  Assignment Problems and the Location of Economic Activities , 1957 .

[18]  Tommi S. Jaakkola,et al.  Fast optimal leaf ordering for hierarchical clustering , 2001, ISMB.

[19]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Gustavo Stolovitzky,et al.  Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression data , 2004, Bioinform..

[21]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Debashis Ghosh,et al.  Mixture models for assessing differential expression in complex tissues using microarray data , 2004, Bioinform..

[23]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .

[24]  T. Ørntoft,et al.  Gene expression in colorectal cancer. , 2002, Cancer research.

[25]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.