Grouping and visualizing human endogenous retroviruses by bootstrapping median self-organizing maps

About eight percent of the human genome consists of human endogenous retrovirus sequences. Human endogenous retroviruses (HERV) are remains from ancient infections by retroviruses. The HERVs are mutated and deficient, but they still may give rise to transcripts or may affect the expression of human genes. The HERVs stem from several kinds of retroviruses., The possible current functioning of the HERV sequences may reflect the origin of the HERVs. Hence, the classification of the diverse HERV sequences is a natural starting point when investigating the effect of HERVs in humans. The current HERV taxonomy is incomplete: some sequences cannot be assigned to any class and the classification is ambiguous for others. A median self-organizing map (SOM), a SOM for data about pairwise distances between samples, can be used to group all the HERVs found in the human genome. It visualizes the collection of 3661 HERV sequences found by the RetroTector system, on a two-dimensional display that represents similarity relationships between individual sequences, as well as cluster structures and similarities of clusters. The SOM, as any dimensionality reduction method, necessarily has to make compromises when representing the data. In this work we extend the visualizations by bootstrap-based estimates on which parts of the visualization are reliable and which not, and use the SOM to find potentially new HERV groups.

[1]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[2]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[3]  M. Tristem,et al.  The Evolution, Distribution and Diversity of Endogenous Retroviruses , 2003, Virus Genes.

[4]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[5]  Mats Lindeskog Transcription, splicing and genetic structure within the human endogenous retroviral HERV-H family. , 1999 .

[6]  Panu Somervuo,et al.  Self-organizing maps of symbol strings , 1998, Neurocomputing.

[7]  S. Rowland-Jones,et al.  Demystified . . . Human endogenous retroviruses , 2003, Molecular pathology : MP.

[8]  Panu Somervuo,et al.  Clustering and Visualization of Large Protein Sequence Databases by Means of an Extension on the Self-Organizing Map , 2000, Discovery Science.

[9]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[10]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[12]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[13]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[14]  David J Griffiths,et al.  Endogenous retroviruses in the human genome sequence , 2001, Genome Biology.

[15]  S. Holmes,et al.  Bootstrapping Phylogenetic Trees: Theory and Methods , 2003 .

[16]  David Neil Cooper,et al.  Encyclopedia of the Human Genome , 2003 .

[17]  Philippe Dessen,et al.  Identification, Phylogeny, and Evolution of Retroviral Elements Based on Their Envelope Genes , 2001, Journal of Virology.

[18]  Anil K. Jain,et al.  Bootstrap technique in cluster analysis , 1987, Pattern Recognit..

[19]  Samuel Kaski,et al.  Clustering of Human Endogenous Retrovirus Sequences with Median Self-Organizing Map , 2003 .

[20]  Marie Cottrell,et al.  Bootstrapping Self-Organizing Maps to assess the statistical significance of local proximity , 2000, ESANN.

[21]  M. Bock,et al.  Endogenous retroviruses and the human germline. , 2000, Current opinion in genetics & development.

[22]  Michael Tristem,et al.  Identification and Characterization of Novel Human Endogenous Retrovirus Families by Phylogenetic Screening of the Human Genome Mapping Project Database , 2000, Journal of Virology.

[23]  C. Grafton Molecular Pathology , 1976, British Journal of Cancer.

[24]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[25]  Panu Somervuo,et al.  How to make large self-organizing maps for nonvectorial data , 2002, Neural Networks.

[26]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[27]  M. Cohen,et al.  Human endogenous proviruses. , 1989, Current topics in microbiology and immunology.

[28]  J. Blomberg,et al.  Diversity of human endogenous retrovirus class II-like sequences. , 1999, The Journal of general virology.

[29]  J. Michael Connor Encyclopedia of the human genome , 2004, Human Genetics.

[30]  Jarkko Venna,et al.  Trustworthiness and metrics in visualizing similarity of gene expression , 2003, BMC Bioinformatics.

[31]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .