Visualizing high dimensional datasets using parallel coordinates: Application to gene prioritization

In this paper, we introduce a visualization tool for interactive and efficient exploration of high dimensional data using parallel coordinates. An algorithm is developed to find an optimal permutation of dimensions, which allows the data miner to immediately see the most important features or irregularities in the dataset. This is implemented as a genetic algorithm based on the travelling salesman problem using maximal correlation as fitness. Other features of the tool include selection operators to group the data such as selection by intersection or by angle, orthogonal and density plots complementing the parallel coordinates plot, manual arrangement of permutation order of the dimensions, possibility to show all plots necessary to see all dimensional relations and displaying a certain number of standard deviations for each dimension separately. The tool is applied to multiple gene prioritization cases in search of genes that are relevant to certain genetic disorders. The used datasets are obtained with the MerKator and Endeavour tools and include a Breast cancer, Cataract, Charcoth-Marie-Tooth and Cardiomyopathy dataset, as well as a dataset relating 29 diseases with 22206 genes. Our tool, manual and data can be downloaded from http://www.toomas.be/parcoord/.

[1]  R. Piro,et al.  Computational approaches to disease‐gene prediction: rationale, classification and successes , 2012, The FEBS journal.

[2]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[3]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[4]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[5]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Jian Ye,et al.  BLAST: improvements for better sequence analysis , 2006, Nucleic Acids Res..

[7]  James E. Baker,et al.  Reducing Bias and Inefficienry in the Selection Algorithm , 1987, ICGA.

[8]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[9]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[10]  Jesse Gillis,et al.  The Impact of Multifunctional Genes on "Guilt by Association" Analysis , 2011, PloS one.

[11]  B. De Moor,et al.  TXTGate: profiling gene groups with text-based information , 2004, Genome Biology.

[12]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[13]  J. Khan,et al.  Database of mRNA gene expression profiles of multiple human organs. , 2005, Genome research.

[14]  Alfred Inselberg,et al.  Parallel Coordinates: Visual Multidimensional Geometry and Its Applications , 2003, KDIR.

[15]  Bart De Moor,et al.  TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis , 2005, Nucleic Acids Res..

[16]  Stephan M. Winkler,et al.  Genetic Algorithms and Genetic Programming - Modern Concepts and Practical Applications , 2009 .

[17]  Matthew O. Ward,et al.  Clutter Reduction in Multi-Dimensional Data Visualization Using Dimension Reordering , 2004, IEEE Symposium on Information Visualization.

[18]  Matthew O. Ward,et al.  Clutter Reduction in Multi-Dimensional Data Visualization Using Dimension Reordering , 2004 .

[19]  Z H Ahmed,et al.  GENETIC ALGORITHM FOR THE TRAVELING SALESMAN PROBLEM USING SEQUENTIAL CONSTRUCTIVE CROSSOVER , 2010 .

[20]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[21]  Paul Pavlidis,et al.  “Guilt by Association” Is the Exception Rather Than the Rule in Gene Networks , 2012, PLoS Comput. Biol..

[22]  Matthew O. Ward,et al.  Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[23]  Bart De Moor,et al.  Gene prioritization and clustering by multi-view text mining , 2010, BMC Bioinformatics.

[24]  Bart De Moor,et al.  A guide to web tools to prioritize candidate genes , 2011, Briefings Bioinform..