CIGenotyper: A Machine Learning Approach for Genotyping Complex Indel Calls

Complex insertion and deletion (complex indel) is a rare category of genomic structural variations. A complex indel presents as one or multiple DNA fragments inserted into the genomic location where a deletion occurs. Several studies emphasize the importance of complex indels, and some state-of-the-art approaches are proposed to detect them from sequencing data. However, genotyping complex indel calls is another challenged computational problem because some commonly used features for genotyping indel calls from the sequencing data could be invalid due to the components of complex indels. Thus, in this article, we propose a machine learning approach, CIGenotyper to estimate genotypes of complex indel calls. CIGenotyper adopts a relevance vector machine (RVM) framework. For each candidate call, it first extracts a set of features from the candidate region, which usually includes the read depth, the variant allelic frequency for aligned contigs, the numbers of the splitting and discordant paired-end reads, etc. For a complex indel call, given its features to a trained RVM, the model outputs the genotype with highest likelihood. An algorithm is also proposed to train the RVM. We compare our approach to two popular approaches, Gindel and Pindel, on multiple groups of artificial datasets. The results of our model outperforms them on average success rates in most of the cases when vary the coverages of the given data, the read lengths and the distributions of the lengths of the pre-set complex indels.

[1]  Jing Xu,et al.  Identifying Heterogeneity Patterns of Allelic Imbalance on Germline Variants to Infer Clonal Architecture , 2017, ICIC.

[2]  Ondrej Libiger,et al.  A statistical method for the detection of variants from next-generation resequencing of DNA pools , 2016, Bioinform..

[3]  Yufeng Wu,et al.  GINDEL: Accurate Genotype Calling of Insertions and Deletions from Low Coverage Population Sequence Reads , 2014, PloS one.

[4]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[5]  Yi Huang,et al.  An graph-based algorithm for prioritizing cancer susceptibility genes from gene fusion data , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[6]  Alexander Schönhuth,et al.  Characteristics of de novo structural changes in the human genome , 2015, Genome research.

[7]  Yu Geng,et al.  An improved burden-test pipeline for identifying associations from rare germline and somatic variants , 2017, BMC Genomics.

[8]  Jan O. Korbel,et al.  Computational Pan-Genomics: Status, Promises and Challenges , 2016 .

[9]  Li Ding,et al.  Patterns and functional implications of rare germline variants across 12 cancer types , 2015, Nature Communications.

[10]  Emmanuel Barillot,et al.  SV-Bay: structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability , 2016, Bioinform..

[11]  Iman Hajirasouliha,et al.  MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels , 2013, Bioinform..

[12]  Jin Zhang,et al.  An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data , 2012, BMC Bioinformatics.

[13]  Joshua F. McMichael,et al.  Systematic Discovery of Complex Indels in Human Cancers , 2015, Nature medicine.

[14]  Yi Huang,et al.  Detecting complex indels with wide length-spectrum from the third generation sequencing data , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[15]  José Luis Rojo-Álvarez,et al.  Nonlinear System Identification With Composite Relevance Vector Machines , 2007, IEEE Signal Processing Letters.