Critical assessment of bioinformatics methods for the characterization of pathological repeat expansions with single-molecule sequencing data

A number of studies have reported the successful application of single-molecule sequencing technologies to the determination of the size and sequence of pathological expanded microsatellite repeats over the last 5 years. However, different custom bioinformatics pipelines were employed in each study, preventing meaningful comparisons and somewhat limiting the reproducibility of the results. In this review, we provide a brief summary of state-of-the-art methods for the characterization of expanded repeats alleles, along with a detailed comparison of bioinformatics tools for the determination of repeat length and sequence, using both real and simulated data. Our reanalysis of publicly available human genome sequencing data suggests a modest, but statistically significant, increase of the error rate of single-molecule sequencing technologies at genomic regions containing short tandem repeats. However, we observe that all the methods herein tested, irrespective of the strategy used for the analysis of the data (either based on the alignment or assembly of the reads), show high levels of sensitivity in both the detection of expanded tandem repeats and the estimation of the expansion size, suggesting that approaches based on single-molecule sequencing technologies are highly effective for the detection and quantification of tandem repeat expansions and contractions.

[1]  S. Utama,et al.  ANALISIS KEBERLANJUTAN PENGELOLAAN LINGKUNGAN DAERAH ALIRAN SUNGAI AIR BENGKULU BERBASIS KEMASYARAKATAN , 2018, Naturalis: Jurnal Penelitian Pengelolaan Sumber Daya Alam dan Lingkungan.

[2]  Ryan J. Haasl,et al.  A genomic portrait of human microsatellite variation. , 2011, Molecular biology and evolution.

[3]  K. Sobczak,et al.  Patterns of CAG repeat interruptions in SCA1 and SCA2 genes in relation to repeat instability , 2004, Human mutation.

[4]  E. Eichler,et al.  Interruptions in the triplet repeats of SCA1 and FRAXA reduce the propensity and complexity of slipped strand DNA (S-DNA) formation. , 1998, Biochemistry.

[5]  Stephen J. Tapscott,et al.  CTCF cis-Regulates Trinucleotide Repeat Instability in an Epigenetic Manner: A Novel Basis for Mutational Hot Spot Determination , 2008, PLoS genetics.

[6]  C. McMurray,et al.  A brief history of triplet repeat diseases. , 2013, Methods in molecular biology.

[7]  Mikael Bodén,et al.  Sequencing technologies and tools for short tandem repeat variation detection , 2015, Briefings Bioinform..

[8]  R. Lahue,et al.  Stabilizing Effects of Interruptions on Trinucleotide Repeat Expansions in Saccharomyces cerevisiae , 2000, Molecular and Cellular Biology.

[9]  Ali Bashir,et al.  Resolving complex tandem repeats with long reads , 2014, Bioinform..

[10]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[11]  A. Magi,et al.  Detection of Genomic Structural Variants from Next-Generation Sequencing Data , 2015, Front. Bioeng. Biotechnol..

[12]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[13]  M. Hayden,et al.  Somatic and gonadal mosaicism of the Huntington disease gene CAG repeat in brain and sperm , 1994, Nature Genetics.

[14]  K. Sleegers,et al.  NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION , 2019, Genome Biology.

[15]  R. Frants,et al.  The D4Z4 repeat-mediated pathogenesis of facioscapulohumeral muscular dystrophy. , 2005, American journal of human genetics.

[16]  Martin C. Frith,et al.  Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads , 2019, Genome Biology.

[17]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[18]  Alan M. Kwong,et al.  Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers , 2015, Nature Genetics.

[19]  M. Baralle,et al.  Influence of Friedreich ataxia GAA noncoding repeat expansions on pre-mRNA processing. , 2008, American journal of human genetics.

[20]  K. Devriendt,et al.  Detecting AGG Interruptions in Male and Female FMR1 Premutation Carriers by Single‐Molecule Sequencing , 2017, Human mutation.

[21]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[22]  Sarah McCalmon,et al.  Sequencing the unsequenceable: Expanded CGG-repeat alleles of the fragile X gene , 2013, Genome research.

[23]  Francesca Giordano,et al.  Oxford Nanopore MinION Sequencing and Genome Assembly , 2016, Genom. Proteom. Bioinform..

[24]  H. Paulson,et al.  Author Correction: Spinocerebellar ataxias: prospects and challenges for therapy development , 2018, Nature Reviews Neurology.

[25]  R. Sinden,et al.  Structural analysis of slipped-strand DNA (S-DNA) formed in (CTG)n. (CAG)n repeats from the myotonic dystrophy locus. , 1998, Nucleic acids research.

[26]  Melvin G McInnis,et al.  Expansion of a novel CAG trinucleotide repeat in the 5′ region of PPP2R2B is associated with SCA12 , 1999, Nature Genetics.

[27]  Kengo Kinoshita,et al.  Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals , 2015, Nature Communications.

[28]  L. M. Silva,et al.  Mechanisms of transcriptional dysregulation in repeat expansion disorders. , 2014, Biochemical Society transactions.

[29]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[30]  F. Alkuraya Genetics and genomic medicine in Saudi Arabia , 2014, Molecular genetics & genomic medicine.

[31]  Onur Mutlu,et al.  Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions , 2017, Briefings Bioinform..

[32]  Jan O. Korbel,et al.  Phenotypic impact of genomic structural variation: insights from and for human disease , 2013, Nature Reviews Genetics.

[33]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[34]  Tae-Min Kim,et al.  Detecting structural variations in the human genome using next generation sequencing. , 2010, Briefings in functional genomics.

[35]  Mauricio O. Carneiro,et al.  Pacific biosciences sequencing technology for genotyping and variation discovery in human data , 2012, BMC Genomics.

[36]  M. Siciliano,et al.  Dramatic, expansion-biased, age-dependent, tissue-specific somatic mosaicism in a transgenic mouse model of triplet repeat instability. , 2000, Human molecular genetics.

[37]  Depeng Wang,et al.  Interrogating the “unsequenceable” genomic trinucleotide repeat disorders by long-read sequencing , 2017, Genome Medicine.

[38]  Georg Auburger,et al.  Moderate expansion of a normally biallelic trinucleotide repeat in spinocerebellar ataxia type 2 , 1996, Nature Genetics.

[39]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[40]  David Heckerman,et al.  Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes , 2017, American journal of human genetics.

[41]  S. Naylor,et al.  Myotonic Dystrophy Type 2 Caused by a CCTG Expansion in Intron 1 of ZNF9 , 2001, Science.

[42]  J. Littleton,et al.  The biological function of the Huntingtin protein and its relevance to Huntington's Disease pathology. , 2011, Current trends in neurology.

[43]  M. Frith,et al.  Nanopore-based single molecule sequencing of the D4Z4 array responsible for facioscapulohumeral muscular dystrophy , 2017, Scientific Reports.

[44]  David Cyranoski,et al.  China embraces precision medicine on a massive scale , 2016, Nature.

[45]  Miriam K. Konkel,et al.  Genome analysis of the platypus reveals unique signatures of evolution , 2008, Nature.

[46]  J. Rothstein,et al.  RAN proteins and RNA foci from antisense transcripts in C9ORF72 ALS and frontotemporal dementia , 2013, Proceedings of the National Academy of Sciences.

[47]  Song Liu,et al.  Computational methods for detecting copy number variations in cancer genome using next generation sequencing: principles and challenges , 2013, Oncotarget.

[48]  F. Yu,et al.  SMRT Sequencing of Long Tandem Nucleotide Repeats in SCA10 Reveals Unique Insight of Repeat Expansion Structure , 2015, PloS one.

[49]  V. Willour,et al.  A disorder similar to Huntington's disease is associated with a novel CAG repeat expansion , 2001, Annals of neurology.

[50]  John H. Wilson,et al.  Instability and chromatin structure of expanded trinucleotide repeats. , 2009, Trends in genetics : TIG.

[51]  D. Sulzer,et al.  CARGO RECOGNITION FAILURE IS RESPONSIBLE FOR INEFFICIENT AUTOPHAGY IN HUNTINGTON’S DISEASE , 2010, Nature Neuroscience.

[52]  E. Zeggini,et al.  The African Genome Variation Project shapes medical genetics in Africa , 2014, Nature.

[53]  G Campanella,et al.  The relationship between trinucleotide (GAA) repeat length and clinical features in Friedreich ataxia. , 1996, American journal of human genetics.

[54]  Sven Rahmann,et al.  SimLoRD: Simulation of Long Read Data , 2016, Bioinform..

[55]  P. Patel,et al.  Friedreich ataxia: from GAA triplet-repeat expansion to frataxin deficiency. , 2001, American journal of human genetics.

[56]  T. Ashizawa,et al.  Germline mutational dynamics in myotonic dystrophy type 1 males , 2004, Neurology.

[57]  J. Penney,et al.  Trinucleotide repeat length instability and age of onset in Huntington's disease , 1993, Nature Genetics.

[58]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[59]  C. E. Pearson,et al.  Slipped-strand DNAs formed by long (CAG)*(CTG) repeats: slipped-out repeats and slip-out junctions. , 2002, Nucleic acids research.

[60]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[61]  Takanori Yamagata,et al.  Large expansion of the ATTCT pentanucleotide repeat in spinocerebellar ataxia type 10 , 2000, Nature Genetics.

[62]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[63]  T. Bird,et al.  An untranslated CTG expansion causes a novel form of spinocerebellar ataxia (SCA8) , 1999, Nature Genetics.

[64]  T. Ashizawa,et al.  Somatic mosaicism, germline expansions, germline reversions and intergenerational reductions in myotonic dystrophy males: small pool PCR analyses. , 1995, Human molecular genetics.

[65]  Yasubumi Sakakibara,et al.  Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data , 2017, Briefings Bioinform..

[66]  S. Turner,et al.  Single-locus enrichment without amplification for sequencing and direct detection of epigenetic modifications , 2016, Molecular Genetics and Genomics.

[67]  M. Tschannen,et al.  De novo repeat interruptions are associated with reduced somatic instability and mild or absent clinical features in myotonic dystrophy type 1 , 2018, European Journal of Human Genetics.

[68]  S. Nolin,et al.  Fragile X full mutation alleles composed of few alleles: Implications for CGG repeat expansion , 2008, American journal of medical genetics. Part A.

[69]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[70]  C. E. Pearson,et al.  Repeat instability as the basis for human diseases and as a potential target for therapy , 2010, Nature Reviews Molecular Cell Biology.

[71]  Dalyir I. Pretto,et al.  CGG allele size somatic mosaicism and methylation in FMR1 premutation alleles , 2014, Journal of Medical Genetics.

[72]  M. Rugge,et al.  Origin of spurious multiple bands in the amplification of microsatellite sequences. , 1999, Molecular pathology : MP.

[73]  Tyson A. Clark,et al.  Amplification-free, CRISPR-Cas9 Targeted Enrichment and SMRT Sequencing of Repeat-Expansion Disease Causative Genomic Regions , 2017, bioRxiv.

[74]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[75]  B. Hayward,et al.  Improved Assays for AGG Interruptions in Fragile X Premutation Carriers. , 2017, The Journal of molecular diagnostics : JMD.

[76]  Paul Medvedev,et al.  Accurate typing of short tandem repeats from genome-wide sequencing data and its applications , 2015, Genome research.

[77]  Tyson A. Clark,et al.  Parkinson’s disease associated with pure ATXN10 repeat expansion , 2017, npj Parkinson's Disease.

[78]  Lin Kang,et al.  CAGm: a repository of germline microsatellite variations in the 1000 genomes project , 2018, Nucleic Acids Res..

[79]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[80]  M. Milá,et al.  Paternal transmission of a FMR1 full mutation allele , 2017, American journal of medical genetics. Part A.

[81]  J. Taylor,et al.  Repeat expansion disease: progress and puzzles in disease pathogenesis , 2010, Nature Reviews Genetics.

[82]  H. Eiberg,et al.  Mitotic and meiotic instability of the CAG trinucleotide repeat in spinocerebellar ataxia type 1 , 1998, Human Genetics.

[83]  Tyson A. Clark,et al.  Detailed analysis of HTT repeat elements in human blood using targeted amplification‐free long‐read sequencing , 2018, Human mutation.

[84]  R. Roos,et al.  Somatic expansion of the (CAG)n repeat in Huntington disease brains , 1995, Human Genetics.

[85]  David Heckerman,et al.  A Hexanucleotide Repeat Expansion in C9ORF72 Is the Cause of Chromosome 9p21-Linked ALS-FTD , 2011, Neuron.

[86]  J. Sutcliffe,et al.  Variation of the CGG repeat at the fragile X site results in genetic instability: Resolution of the Sherman paradox , 1991, Cell.

[87]  A. Ameur,et al.  Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics , 2018, Nucleic acids research.

[88]  H. Ellegren Microsatellites: simple sequences with complex evolution , 2004, Nature Reviews Genetics.

[89]  E. Salido,et al.  Single molecule real time sequencing in ADTKD-MUC1 allows complete assembly of the VNTR and exact positioning of causative mutations , 2018, Scientific Reports.

[90]  Marzena Wojciechowska,et al.  Cellular toxicity of expanded RNA repeats: focus on RNA foci , 2011, Human molecular genetics.

[91]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[92]  Eric J Duncavage,et al.  Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. , 2013, Cancer genetics.

[93]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[94]  P. Shelbourne,et al.  Dramatic mutation instability in HD mouse striatum: does polyglutamine load contribute to cell-specific vulnerability in Huntington's disease? , 2000, Human molecular genetics.

[95]  S. W. Davies,et al.  Aggregation of huntingtin in neuronal intranuclear inclusions and dystrophic neurites in brain. , 1997, Science.

[96]  S. Tsuji,et al.  Pathology of CAG repeat diseases , 2000, Neuropathology : official journal of the Japanese Society of Neuropathology.

[97]  D. Schaid,et al.  Sequence analysis of the fragile X trinucleotide repeat: implications for the origin of the fragile X mutation. , 1994, Human molecular genetics.

[98]  D. Housman,et al.  Foci of trinucleotide repeat transcripts in nuclei of myotonic dystrophy cells and tissues , 1995, The Journal of cell biology.

[99]  Marzia A. Cremona,et al.  Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate , 2018, bioRxiv.

[100]  S. Tavaré,et al.  Single sperm analysis of the trinucleotide repeats in the Huntington's disease gene: quantification of the mutation frequency spectrum. , 1995, Human molecular genetics.

[101]  Renmin Han,et al.  DeepSimulator: a deep simulator for Nanopore sequencing , 2017, bioRxiv.

[102]  E. Bradbury,et al.  GAA instability in Friedreich's Ataxia shares a common, DNA-directed and intraallelic mechanism with other trinucleotide diseases. , 1998, Molecular cell.

[103]  F. D. Chaumont On The Anatomy of the Organ of Hearing in Relation to the Discovery of the Principle of the Microphone of Prof. D. E. Hughes, and the magnophone of Mr. W. L. Scott, A.S.T.E. , 1878, Nature.

[104]  L. Loeb,et al.  The fragile X syndrome d(CGG)n nucleotide repeats form a stable tetrahelical structure. , 1994, Proceedings of the National Academy of Sciences of the United States of America.