Self-organizing map-based discovery and visualization of human endogenous retroviral sequence groups

About 8 per cent of the human genome consists of human endogenous retroviral sequences (HERVs), which are remains from ancient infections. The HERVs may give rise to transcripts or affect the expression of human genes. The first step in understanding HERV function is to classify HERVs into families. In this work we study the relationships of existing HERV families and detect potentially new HERV families. A Median Self-Organizing Map (SOM), a SOM for non-vectorial data, is used to group and visualize a collection of 3661 HERVs. The SOM-based analysis is complemented with estimates of the reliability of the results. A novel trustworthiness visualization method is used to estimate which parts of the SOM visualization are reliable and which not. The reliability of extracted interesting HERV groups is verified by a bootstrap procedure suitable for SOM visualization-based analysis. The SOM detects a group of epsilonretroviral sequences and a group of ERV9, HERVW, and HUERSP3 sequences which suggests that ERV9 and HERVW sequences may have a common origin.

[1]  H. Kazazian Mobile Elements: Drivers of Genome Evolution , 2004, Science.

[2]  Roziah Kambol,et al.  Complete nucleotide sequence of an endogenous retrovirus from the amphibian, Xenopus laevis. , 2003, Virology.

[3]  R. Hehlmann,et al.  Human endogenous retroviruses. , 1988, Leukemia.

[4]  Jarkko Venna,et al.  Trustworthiness and metrics in visualizing similarity of gene expression , 2003, BMC Bioinformatics.

[5]  C. Grafton Molecular Pathology , 1976, British Journal of Cancer.

[6]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[7]  Panu Somervuo,et al.  How to make large self-organizing maps for nonvectorial data , 2002, Neural Networks.

[8]  David J Griffiths,et al.  Endogenous retroviruses in the human genome sequence , 2001, Genome Biology.

[9]  V. Vogt,et al.  Nucleotide sequence and protein analysis of a complex piscine retrovirus, walleye dermal sarcoma virus , 1995, Journal of virology.

[10]  S. Rowland-Jones,et al.  Demystified . . . Human endogenous retroviruses , 2003, Molecular pathology : MP.

[11]  Samuel Kaski,et al.  Clustering of Human Endogenous Retrovirus Sequences with Median Self-Organizing Map , 2003 .

[12]  Marie Cottrell,et al.  Bootstrapping Self-Organizing Maps to assess the statistical significance of local proximity , 2000, ESANN.

[13]  Panu Somervuo,et al.  Clustering and Visualization of Large Protein Sequence Databases by Means of an Extension on the Self-Organizing Map , 2000, Discovery Science.

[14]  M. Bock,et al.  Endogenous retroviruses and the human germline. , 2000, Current opinion in genetics & development.

[15]  Michael Tristem,et al.  Identification and Characterization of Novel Human Endogenous Retrovirus Families by Phylogenetic Screening of the Human Genome Mapping Project Database , 2000, Journal of Virology.

[16]  Erik D. Demaine,et al.  K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data , 2002, WABI.

[17]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[18]  J. Blomberg,et al.  Diversity of human endogenous retrovirus class II-like sequences. , 1999, The Journal of general virology.

[19]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[20]  Tommi S. Jaakkola,et al.  Fast optimal leaf ordering for hierarchical clustering , 2001, ISMB.

[21]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[22]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[23]  Jarkko Venna,et al.  Visualized Atlas of a Gene Expression Databank , 2005 .

[24]  Dixie L. Mager,et al.  Retroviral Repeat Sequences , 2005 .

[25]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[26]  Anil K. Jain,et al.  Bootstrap technique in cluster analysis , 1987, Pattern Recognit..

[27]  M. Cohen,et al.  Human endogenous proviruses. , 1989, Current topics in microbiology and immunology.

[28]  Philippe Dessen,et al.  Identification, Phylogeny, and Evolution of Retroviral Elements Based on Their Envelope Genes , 2001, Journal of Virology.

[29]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[30]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[31]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[33]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[34]  A. Furano,et al.  The biological properties and evolutionary dynamics of mammalian LINE-1 retrotransposons. , 2000, Progress in nucleic acid research and molecular biology.

[35]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[36]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[37]  James W. Casey,et al.  Sequence and Transcriptional Analyses of the Fish Retroviruses Walleye Epidermal Hyperplasia Virus Types 1 and 2: Evidence for a Gene Duplication , 1999, Journal of Virology.

[38]  B F Lang,et al.  Phylogenetic and epidemiologic analysis of the walleye dermal sarcoma virus. , 1996, Virology.

[39]  M. Tristem,et al.  The Evolution, Distribution and Diversity of Endogenous Retroviruses , 2003, Virus Genes.

[40]  Panu Somervuo,et al.  Self-organizing maps of symbol strings , 1998, Neurocomputing.