Estimating the Length Distributions of Genomic Micro-satellites from Next Generation Sequencing Data

Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. In contrast to unique genome, genomic micro-satellites expose high intrinsic polymorphisms, which mainly derive from variability in length. Length distributions are widely used to represent the polymorphisms. Recent studies report that some micro-satellites alter their length distributions significantly in tumor tissue samples comparing to the ones observed in normal samples, which becomes a hot topic in cancer genomics. Several state-of-the-art approaches are proposed to identify the length distributions from the sequencing data. However, the existing approaches can only handle the micro-satellites shorter than one read length, which limits the potential research on long micro-satellite events. In this article, we propose a probabilistic approach, implemented as ELMSI that estimates the length distributions of the micro-satellites longer than one read length. The core algorithm works on a set of mapped reads. It first clusters the reads, and a k-mer extension algorithm is adopted to detect the unit and breakpoints as well. Then, it conducts an expectation maximization algorithm to approach the true length distributions. According to the experiments, ELMSI is able to handle micro-satellites with the length spectrum from shorter than one read length to 10 kbps scale. A series of comparison experiments are applied, which vary the numbers of micro-satellite regions, read lengths and sequencing coverages, and ELMSI outperforms MSIsensor in most of the cases.

[1]  Kai Ye,et al.  MSIsensor: microsatellite instability detection using paired tumor-normal sequence data , 2014, Bioinform..

[2]  Colin C Pritchard,et al.  Microsatellite instability detection by next generation sequencing. , 2014, Clinical chemistry.

[3]  Kathleen M Murphy,et al.  [Clinicopathological features and types of microsatellite instability in 1394 patients with colorectal cancer]. , 2020, Nan fang yi ke da xue xue bao = Journal of Southern Medical University.

[4]  Li Ding,et al.  Patterns and functional implications of rare germline variants across 12 cancer types , 2015, Nature Communications.

[5]  Daniel J Sargent,et al.  Tumor microsatellite-instability status as a predictor of benefit from fluorouracil-based adjuvant chemotherapy for colon cancer. , 2003, The New England journal of medicine.

[6]  Ming Yu,et al.  Complex MSH2 and MSH6 mutations in hypermutated microsatellite unstable advanced prostate cancer , 2014, Nature Communications.

[7]  C. Chi,et al.  A genome‐wide study of microsatellite instability in advanced gastric carcinoma , 2001, Cancer.

[8]  Xiangke Liao,et al.  Correction: Corrigendum: Genome-wide adaptive complexes to underground stresses in blind mole rats Spalax , 2015, Nature Communications.

[9]  B. Teh,et al.  MSIseq: Software for Assessing Microsatellite Instability from Catalogs of Somatic Mutations , 2015, Scientific Reports.

[10]  H. Ellegren Microsatellites: simple sequences with complex evolution , 2004, Nature Reviews Genetics.

[11]  Timothy M. Pawlik,et al.  Colorectal Carcinogenesis: MSI-H Versus MSI-L , 2004, Disease markers.

[12]  M. Krystal,et al.  A member of a new repeated sequence family which is conserved throughout eucaryotic evolution is found between the human δ and β globin genes , 1981 .

[13]  S. Warren,et al.  Trinucleotide repeat expansion and human disease. , 1995, Annual review of genetics.

[14]  Peter J. Park,et al.  The Landscape of Microsatellite Instability in Colorectal and Endometrial Cancer Genomes , 2013, Cell.

[15]  Holger Vogelsang,et al.  Microsatellite instability of selective target genes in HNPCC-associated colon adenomas , 2005, Oncogene.

[16]  M. Bertagnolli,et al.  Molecular origins of cancer: Molecular basis of colorectal cancer. , 2009, The New England journal of medicine.