Nearest Neighbor Networks: clustering expression data based on gene neighborhoods

BackgroundThe availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes).ResultsWe developed Nearest Neighbor Networks (NNN), a graph-based algorithm to generate clusters of genes with similar expression profiles. This method produces clusters based on overlapping cliques within an interaction network generated from mutual nearest neighborhoods. This focus on nearest neighbors rather than on absolute distance measures allows us to capture clusters with high connectivity even when they are spatially separated, and requiring mutual nearest neighbors allows genes with no sufficiently similar partners to remain unclustered. We compared the clusters generated by NNN with those generated by eight other clustering methods. NNN was particularly successful at generating functionally coherent clusters with high precision, and these clusters generally represented a much broader selection of biological processes than those recovered by other methods.ConclusionThe Nearest Neighbor Networks algorithm is a valuable clustering method that effectively groups genes that are likely to be functionally related. It is particularly attractive due to its simplicity, its success in the analysis of large datasets, and its ability to span a wide range of biological functions with high precision.

[1]  M. N. S. Swamy,et al.  Graphs: Theory and Algorithms: Thulasiraman/Graphs , 1992 .

[2]  H. Toutenburg,et al.  Lehmann, E. L., Nonparametrics: Statistical Methods Based on Ranks, San Francisco. Holden‐Day, Inc., 1975. 480 S., $ 22.95 . , 1977 .

[3]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[5]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[6]  Martin Ester,et al.  Assessment and integration of publicly available SAGE, cDNA microarray, and oligonucleotide microarray expression data for global coexpression analyses. , 2005, Genomics.

[7]  Vito Di Gesù,et al.  GenClust: A genetic algorithm for clustering gene expression data , 2005, BMC Bioinformatics.

[8]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[9]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[10]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[11]  B. De Moor,et al.  Comparison and meta-analysis of microarray data: from the bench to the computer desk. , 2003, Trends in genetics : TIG.

[12]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[13]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[14]  Matthew A. Hibbs,et al.  Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[15]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[16]  Lei Liu,et al.  Comparisons of graph-structure clustering methods for gene expression data. , 2006, Acta biochimica et biophysica Sinica.

[17]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[18]  M. N. Shanmukha Swamy,et al.  Graphs: Theory and Algorithms , 1992 .

[19]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[20]  Rachel B. Brem,et al.  The landscape of genetic complexity across 5,700 gene expression traits in yeast. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[21]  T. Ideker,et al.  Integrating phenotypic and expression profiles to map arsenic-response networks , 2004, Genome Biology.

[22]  Hongyue Dai,et al.  Widespread aneuploidy revealed by DNA microarray expression profiling , 2000, Nature Genetics.

[23]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[24]  Michael A. Langston,et al.  Extracting Gene Networks for Low-Dose Radiation Using Graph Theoretical Algorithms , 2006, PLoS Comput. Biol..

[25]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[26]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[27]  A I Saeed,et al.  TM4: a free, open-source system for microarray data management and analysis. , 2003, BioTechniques.

[28]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[29]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[30]  Gary D Bader,et al.  Global Mapping of the Yeast Genetic Interaction Network , 2004, Science.

[31]  Olga G. Troyanskaya,et al.  A scalable method for integration and functional analysis of multiple microarray datasets , 2006, Bioinform..

[32]  Kevin R. Coombes,et al.  Identifying Differentially Expressed Genes in cDNA Microarray Experiments , 2001, J. Comput. Biol..

[33]  Bernard Harris,et al.  Graph theory and its applications , 1970 .

[34]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[35]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[36]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[37]  Roberto Marcondes Cesar Junior,et al.  Inference from Clustering with Application to Gene-Expression Microarrays , 2002, J. Comput. Biol..

[38]  Alok J. Saldanha,et al.  Java Treeview - extensible visualization of microarray data , 2004, Bioinform..

[39]  Trey Ideker,et al.  Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data , 2000, J. Comput. Biol..

[40]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[41]  Kenneth H. Rosen,et al.  Discrete Mathematics and its applications , 2000 .

[42]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Joseph Beyene,et al.  Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models , 2005, BMC Bioinformatics.

[44]  Eitan M. Gurari,et al.  Introduction to the theory of computation , 1989 .

[45]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[46]  Ronald W. Davis,et al.  The core meiotic transcriptome in budding yeasts , 2000, Nature Genetics.