Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem

Transcriptomic structural variants (TSVs) — structural variants that affect expressed regions — are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the MULTIPLE COMPATIBLE ARRANGEMENT PROBLEM (MCAP), which seeks k genome rearrangements to maximize the number of reads that are concordant with at least one rearrangement. This directly models the situation of a heterogeneous or diploid sample. We prove that MCAP is NP-hard and provide a -approximation algorithm for k = 1 and a -approximation algorithm for the diploid case (k = 2) assuming an oracle for k = 1. Combining these, we obtain a -approximation algorithm for MCAP when k = 2 (without an oracle). We also present an integer linear programming formulation for general k. We completely characterize the graph structures that require k > 1 to satisfy all edges and show such structures are prevalent in cancer samples. We evaluate our algorithms on 381 TCGA samples and 2 cancer cell lines and show improved performance compared to the state-of-the-art TSV-calling tool, SQUID.

[1]  M. Westerfield,et al.  Characterization of paired tumor and non‐tumor cell lines established from patients with breast cancer , 1998, International journal of cancer.

[2]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[3]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[4]  J. Radich The Molecular Biology of Acute Myeloid Leukemia , 2011 .

[5]  Carl Kingsford,et al.  SQUID: transcriptomic structural variation detection from RNA-seq , 2018, Genome Biology.

[6]  Süleyman Cenk Sahinalp,et al.  deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data , 2011, PLoS Comput. Biol..

[7]  William Stafford Noble,et al.  Integrative detection and analysis of structural variation in cancer genomes , 2018, Nature Genetics.

[8]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[9]  Robert Sedgewick,et al.  Algorithms in c, part 5: graph algorithms, third edition , 2001 .

[10]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[11]  R. Jenkins,et al.  Fusion of HMGA1 to the LPP/TPRG1 intergenic region in a lipoma identified by mapping paraffin-embedded tissues. , 2010, Cancer genetics and cytogenetics.

[12]  John N. Weinstein,et al.  PRADA: pipeline for RNA sequencing data analysis , 2014, Bioinform..

[13]  Ralph Roskies,et al.  Bridges: a uniquely flexible HPC resource for new communities and data analytics , 2015, XSEDE.

[14]  A. Oshlack,et al.  JAFFA: High sensitivity transcriptome-focused fusion gene detection , 2015, Genome Medicine.

[15]  O. Kallioniemi,et al.  FusionCatcher – a tool for finding somatic fusion genes in paired-end RNA-sequencing data , 2014, bioRxiv.

[16]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[17]  David T. W. Jones,et al.  confFuse: High-Confidence Fusion Gene Detection across Tumor Entities , 2017, bioRxiv.

[18]  A. Butte,et al.  Systematic pan-cancer analysis of tumour purity , 2015, Nature Communications.

[19]  Adrian V. Lee,et al.  Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data , 2015, Nucleic acids research.

[20]  J. Melo,et al.  The molecular biology of chronic myeloid leukemia. , 2000, Blood.

[21]  Jun Wang,et al.  SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data , 2013, Genome Biology.

[22]  Faraz Hach,et al.  Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery , 2010, Bioinform..