Related Inference: A Supervised Learning Approach to Detect Signal Variation in Genome Data

The human genome, composed of nucleotides, is represented by a long sequence of the letters A,C,G,T. Typically, organisms in the same species have similar genomes that differ by only a few sequences of varying lengths at varying positions. These differences can be observed in the form of regions where letters are inserted, deleted or inverted. These anomalies are known as structural variants (SVs) and are difficult to detect. The standard approach for identifying SVs involves comparing fragments of DNA from the genome of interest and comparing them to a reference genome. This process is usually complicated by errors produced in both the sequencing and mapping process which may result in an increase in false positive detections. In this work we propose two different approaches for reducing the number of false positives. We focus our attention on refining deletions detected by the popular SV tool delly. In particular, we consider the ability of simultaneously considering sequencing data from a parent and a child using a neural network and gradient boosting as a post-processing step. We compare the performance of each method on simulated and real parent-child data and show that including related individuals in training data greatly improves the ability to detect true SVs.

[1]  Jan O. Korbel,et al.  Phenotypic impact of genomic structural variation: insights from and for human disease , 2013, Nature Reviews Genetics.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Achraf El Allali,et al.  PostSV: A Post–Processing Approach for Filtering Structural Variations , 2020, Bioinformatics and biology insights.

[4]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[5]  Jonathan Sebat,et al.  SV2: Accurate Structural Variation Genotyping and De Novo Mutation Detection from Whole Genomes , 2017, bioRxiv.

[6]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[7]  Faisal Saeed,et al.  Bioactive Molecule Prediction Using Extreme Gradient Boosting , 2016, Molecules.

[8]  Ryan E. Mills,et al.  Structural variation in the sequencing era , 2019, Nature Reviews Genetics.

[9]  Benjamin J. Raphael,et al.  An integrative probabilistic model for identification of structural variation in sequencing data , 2012, Genome Biology.

[10]  Insuk Sohn,et al.  Detection of chromosome structural variation by targeted next-generation sequencing and a deep learning application , 2019, Scientific Reports.

[11]  Roummel F. Marcia,et al.  Detecting Novel Structural Variants In Genomes By Leveraging Parent-Child Relatedness , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[12]  Xing Chen,et al.  EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction , 2018, Cell Death & Disease.

[13]  E. O'Leary Ancestry , 2020 .

[14]  Veera Boonjing,et al.  Heart Disease Classification Using Neural Network and Feature Selection , 2011, 2011 21st International Conference on Systems Engineering.

[15]  Harish S. Bhat,et al.  Predicting Adolescent Suicide Attempts with Neural Networks , 2017, ArXiv.

[16]  Alexander Schönhuth,et al.  Characteristics of de novo structural changes in the human genome , 2015, Genome research.

[17]  F. Balloux,et al.  Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast , 2016, Nature Communications.

[18]  Bart Baesens,et al.  Using Neural Network Rule Extraction and Decision Tables for Credit - Risk Evaluation , 2003, Manag. Sci..

[19]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[20]  Martin Dugas,et al.  RSVSim: an R/Bioconductor package for the simulation of structural variations , 2013, Bioinform..

[21]  P Hysi,et al.  Gradient Boosting as a SNP Filter: an Evaluation Using Simulated and Hair Morphology Data , 2013, Journal of data mining in genomics & proteomics.

[22]  Brent S. Pedersen,et al.  cyvcf2: fast, flexible variant analysis with Python , 2017, Bioinform..

[23]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[24]  Junliang Fan,et al.  Comparison of Support Vector Machine and Extreme Gradient Boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A case study in China , 2018 .

[25]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.