Mappability and read length

Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as 104 bases, or 105 − 106 bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of 103 bases. With a read length of 1000 bases, slightly more than 1% of the assembled genome, and slightly less than 1% of the 1 kb reads, are unmappable, excluding the unassembled portion of the human genome (8% in GRCh37/hg19). The slow decay (long tail) of the power-law function implies a diminishing return in converting unmappable regions/reads to become mappable with the increase of the read length, with the understanding that increasing read length will always move toward the direction of 100% mappability.

[1]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[2]  Peter F. Stadler,et al.  Sequence assembly , 2009, Comput. Biol. Chem..

[3]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[4]  Wen-Hsiung Li,et al.  Patterns of segmental duplication in the human genome. , 2004, Molecular biology and evolution.

[5]  K. Jones,et al.  The chromosomal location of human satellite DNA III , 1973, Chromosoma.

[6]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[7]  H. Riethman,et al.  Human subtelomeric duplicon structure and organization , 2007, Genome Biology.

[8]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[9]  Santhosh Girirajan,et al.  Human copy number variation and complex genetic disease. , 2011, Annual review of genetics.

[10]  W. Kuo,et al.  High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays , 1998, Nature Genetics.

[11]  Pedro Miramontes,et al.  Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome , 2013, BMC Bioinformatics.

[12]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[13]  L. S. Cram,et al.  A highly conserved repetitive DNA sequence, (TTAGGG)n, present at the telomeres of human chromosomes. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Harold Swerdlow,et al.  Direct sequencing of small genomes on the Pacific Biosciences RS without library preparation. , 2012, BioTechniques.

[15]  Petr Novák,et al.  Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data , 2010, Bioinform..

[16]  D. Sornette Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools , 2000 .

[17]  J. Taylor,et al.  Repeat expansion disease: progress and puzzles in disease pathogenesis , 2010, Nature Reviews Genetics.

[18]  E. Eichler,et al.  Segmental duplications and copy-number variation in the human genome. , 2005, American journal of human genetics.

[19]  V. Tonk,et al.  Human Chromosome Variation: Heteromorphism and Polymorphism , 2011, Springer Netherlands.

[20]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[21]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[22]  R. Moyzis,et al.  Highly conserved repetitive DNA sequences are present at human centromeres. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Verónica Becher,et al.  Efficient repeat finding in sets of strings via suffix arrays , 2013, Discret. Math. Theor. Comput. Sci..

[24]  M. Quail DNA: Mechanical Breakage , 2010 .

[25]  C. Price,et al.  Telomeric and Subtelomeric Repeat Sequences , 2007 .

[26]  J. Gulcher,et al.  Segmental duplication density decrease with distance to human-mouse breaks of synteny , 2006, European Journal of Human Genetics.

[27]  Wentian Li,et al.  G+C Content Evolution in the Human Genome , 2013 .

[28]  H. Riethman,et al.  Mapping and initial analysis of human subtelomeric sequence assemblies. , 2003, Genome research.

[29]  F. van Nieuwerburgh,et al.  Library construction for next-generation sequencing: overviews and challenges. , 2014, BioTechniques.

[30]  H. Willard,et al.  Centromeres of mammalian chromosomes. , 1990, Trends in genetics : TIG.

[31]  Huseyin Kucuktas,et al.  Library Construction for next Generation Sequencing , 2010 .

[32]  Kun Gao,et al.  Human-chimpanzee alignment: Ortholog exponentials and paralog power laws , 2014, Comput. Biol. Chem..

[33]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[34]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[35]  Mauricio O. Carneiro,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[36]  Bruce Budowle,et al.  Validity of low copy number typing and applications to forensic science. , 2009, Croatian medical journal.

[37]  D. Sornette Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools , 2000 .

[38]  H. Willard,et al.  Analysis of the centromeric regions of the human genome assembly. , 2004, Trends in genetics : TIG.

[39]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[40]  L. Manuelidis Chromosomal localization of complex and simple repeated human DNAs , 1978, Chromosoma.

[41]  E. Blackburn,et al.  Telomeres and telomerase: the path from maize, Tetrahymena and yeast to human cancer and aging , 2006, Nature Medicine.

[42]  Matthias Platzer,et al.  RepARK—de novo creation of repeat libraries from whole-genome NGS reads , 2014, Nucleic acids research.

[43]  Florian Massip,et al.  Neutral evolution of duplicated DNA: an evolutionary stick-breaking process causes scale-invariant behavior. , 2013, Physical review letters.

[44]  Frans,et al.  Bacterial Genomes , 1998, Springer US.

[45]  Germinal Cocho,et al.  Bacterial genomes lacking long-range correlations may not be modeled by low-order Markov chains: The role of mixing statistics and frame shift of neighboring genes , 2014, Comput. Biol. Chem..

[46]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[47]  P. Warburton,et al.  Analysis of the largest tandemly repeated DNA families in the human genome , 2008, BMC Genomics.

[48]  Evan E. Eichler,et al.  An assessment of the sequence gaps: Unfinished business in a finished human genome , 2004, Nature Reviews Genetics.

[49]  M. Hurles,et al.  Copy number variation in human health, disease, and evolution. , 2009, Annual review of genomics and human genetics.

[50]  Mauro Maggioni,et al.  Genomic Characterization of Large Heterochromatic Gaps in the Human Genome Assembly , 2014, PLoS Comput. Biol..

[51]  A. Bairoch,et al.  A testis-specific gene, TPTE, encodes a putative transmembrane tyrosine phosphatase and maps to the pericentromeric region of human chromosomes 21 and 13, and to chromosomes 15, 22, and Y , 1999, Human Genetics.

[52]  L. Manuelidis Repeating restriction fragments of human DNA. , 1976, Nucleic acids research.

[53]  A. Sharp,et al.  Digital Genotyping of Macrosatellites and Multicopy Genes Reveals Novel Biological Functions Associated with Copy Number Variation of Large Tandem Repeats , 2014, PLoS genetics.

[54]  Jonathan Miller,et al.  Algebraic Distribution of Segmental Duplication Lengths in Whole-Genome Sequence Self-Alignments , 2011, PloS one.

[55]  M. E. Aldrup-MacDonald,et al.  The Past, Present, and Future of Human Centromere Genomics , 2014, Genes.

[56]  Nicolas Altemose,et al.  Centromere reference models for human chromosomes X and Y satellite arrays , 2013, Genome research.

[57]  M. Blasco Telomeres and human disease: ageing, cancer and beyond , 2005, Nature Reviews Genetics.

[58]  B. Trask,et al.  Segmental duplications: organization and impact within the current human genome project assembly. , 2001, Genome research.

[59]  Thomas Wiehe,et al.  How repetitive are genomes? , 2006, BMC Bioinformatics.

[60]  D. M. Skinner Satellite DNA's , 1977 .

[61]  S. Antonarakis,et al.  Genomic structure of a copy of the human TPTE gene which encompasses 87 kb on the short arm of chromosome 21 , 2000, Human Genetics.

[62]  Anthony Ralston,et al.  De Bruijn Sequences—A Model Example of the Interaction of Discrete Mathematics and Computer Science , 1982 .

[63]  J. Beckmann,et al.  CATSPER2, a human autosomal nonsyndromic male infertility gene , 2003, European Journal of Human Genetics.

[64]  Verónica Becher,et al.  Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome , 2009, Bioinform..

[65]  Tetsuo Shibuya,et al.  Indexing huge genome sequences for solving various problems. , 2001, Genome informatics. International Conference on Genome Informatics.

[66]  Wentian Li,et al.  Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases , 2014, Comput. Biol. Chem..

[67]  C. Burks,et al.  DNA sequence assembly , 1994, IEEE Engineering in Medicine and Biology Magazine.

[68]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[69]  B. Vissel,et al.  Human alpha satellite DNA--consensus sequence and conserved regions. , 1987, Nucleic acids research.

[70]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[71]  N. Fukui,et al.  Satellite DNA , 1986, Springer Berlin Heidelberg.

[72]  Mona Singh,et al.  Computational solutions for omics data , 2013, Nature Reviews Genetics.

[73]  Stephen W Scherer,et al.  Copy number variations in schizophrenia: critical review and new perspectives on concepts of genetics and disease. , 2010, The American journal of psychiatry.

[74]  Siu-Ming Yiu,et al.  Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences , 2004, ALENEX/ANALC.