LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing

Background Long-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads. However, there is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors. Results In this study, we developed a fast computational tool, LongGF, to efficiently detect candidate gene fusions from long-read RNA-seq data, including cDNA sequencing data and direct mRNA sequencing data. We evaluated LongGF on tens of simulated long-read RNA-seq datasets, and demonstrated its superior performance in gene fusion detection. We also tested LongGF on a Nanopore direct mRNA sequencing dataset and a PacBio sequencing dataset generated on a mixture of 10 cancer cell lines, and found that LongGF achieved better performance to detect known gene fusions over existing computational tools. Furthermore, we tested LongGF on a Nanopore cDNA sequencing dataset on acute myeloid leukemia, and pinpointed the exact location of a translocation (previously known in cytogenetic resolution) in base resolution, which was further validated by Sanger sequencing. Conclusions In summary, LongGF will greatly facilitate the discovery of candidate gene fusion events from long-read RNA-Seq data, especially in cancer samples. LongGF is implemented in C++ and is available at https://github.com/WGLab/LongGF.

[1]  G. Lenoir,et al.  [Chromosomal translocation (11; 22) in cell lines of Ewing's sarcoma]. , 1983, Comptes rendus des seances de l'Academie des sciences. Serie III, Sciences de la vie.

[2]  P. Cin,et al.  Involvement of chromosome X in primary cytogenetic change in human neoplasia: nonrandom translocation in synovial sarcoma. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[3]  M. Ohki Molecular basis of the t(8;21) translocation in acute myeloid leukaemia. , 1993, Seminars in cancer biology.

[4]  P. Lollini,et al.  Insulin-like growth factor I receptor-mediated circuit in Ewing's sarcoma/peripheral neuroectodermal tumor: a possible therapeutic target. , 1996, Cancer research.

[5]  P. Sorensen,et al.  Expression of the ETV6-NTRK3 gene fusion as a primary event in human secretory breast carcinoma. , 2002, Cancer cell.

[6]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[7]  J. Tchinda,et al.  Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. , 2006, Science.

[8]  J. Tchinda,et al.  Recurrent Fusion of TMPRSS2 and ETS Transcription Factor Genes in Prostate Cancer , 2005, Science.

[9]  C. Denkert,et al.  Expression of estrogen receptor-related receptors, a subfamily of orphan nuclear receptors, as new tumor biomarkers in ovarian cancer cells , 2005, Journal of Molecular Medicine.

[10]  F. Speleman,et al.  A novel gene family NBPF: intricate structure generated by gene duplications during primate evolution. , 2005, Molecular biology and evolution.

[11]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[12]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[13]  B. Johansson,et al.  The impact of translocations and gene fusions on cancer causation , 2007, Nature Reviews Cancer.

[14]  H. Aburatani,et al.  Identification of the transforming EML4–ALK fusion gene in non-small-cell lung cancer , 2007, Nature.

[15]  David T. W. Jones,et al.  Tandem duplication producing a novel oncogenic BRAF fusion gene defines the majority of pilocytic astrocytomas. , 2008, Cancer research.

[16]  R. Mantovani,et al.  The myxoid liposarcoma FUS-DDIT3 fusion oncoprotein deregulates NF-κB target genes by interaction with NFKBIZ , 2009, Oncogene.

[17]  P. Edwards Fusion genes and chromosome translocations in the common epithelial cancers , 2009, The Journal of pathology.

[18]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[19]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[20]  Robert J. Marinelli,et al.  ESRRA-C11orf20 Is a Recurrent Gene Fusion in Serous Ovarian Carcinoma , 2011, PLoS biology.

[21]  Krishna R. Kalari,et al.  A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines , 2011, Nucleic acids research.

[22]  Fang Fang,et al.  FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution , 2011, Bioinform..

[23]  A. Martins,et al.  Targeting the Insulin-Like Growth Factor Pathway in Rhabdomyosarcomas: Rationale and Future Perspectives , 2011, Sarcoma.

[24]  Jian Ma,et al.  FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq , 2011, Bioinform..

[25]  Süleyman Cenk Sahinalp,et al.  deFuse: An Algorithm for Gene Fusion Discovery in Tumor RNA-Seq Data , 2011, PLoS Comput. Biol..

[26]  Christopher A. Maher,et al.  ChimeraScan: a tool for identifying chimeric transcription in sequencing data , 2011, Bioinform..

[27]  Vineet Bafna,et al.  Sensitive gene fusion detection using ambiguously mapping RNA-Seq read pairs , 2011, Bioinform..

[28]  Robin L. Jones,et al.  Targeting the Insulin-Like Growth Factor 1 Receptor in Ewing's Sarcoma: Reality and Expectations , 2011, Sarcoma.

[29]  S. Salzberg,et al.  TopHat-Fusion: an algorithm for discovery of novel fusion transcripts , 2011, Genome Biology.

[30]  Xiaobo Zhou,et al.  FusionQ: a novel approach for gene fusion detection and quantification from paired-end RNA-Seq , 2013, BMC Bioinformatics.

[31]  Steven J. M. Jones,et al.  BreakFusion: targeted assembly-based identification of gene fusions in whole transcriptome paired-end sequencing data , 2012, Bioinform..

[32]  S. C. Sahinalp,et al.  nFuse: Discovery of complex genomic rearrangements in cancer using high-throughput sequencing , 2012, Genome research.

[33]  Melanie A. Huntley,et al.  Recurrent R-spondin fusions in colon cancer , 2012, Nature.

[34]  Alberto Magi,et al.  Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript , 2012, Bioinform..

[35]  Jun Wang,et al.  SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data , 2013, Genome Biology.

[36]  Maqc Consortium The MicroArray Quality Control ( MAQC )-II study of common practices for the development and validation of microarray-based predictive models , 2012 .

[37]  M. Knowles,et al.  Oncogenic FGFR3 gene fusions in bladder cancer , 2012, Human molecular genetics.

[38]  Wei Zhang,et al.  Fusion genes in solid tumors: an emerging target for cancer diagnosis and treatment , 2013, Chinese journal of cancer.

[39]  Nickolay A. Khazanov,et al.  Identification of targetable FGFR gene fusions in diverse cancers. , 2013, Cancer discovery.

[40]  M. Nykter,et al.  The tumorigenic FGFR3-TACC3 gene fusion escapes miR-99a regulation in glioblastoma. , 2013, The Journal of clinical investigation.

[41]  John N. Weinstein,et al.  PRADA: pipeline for RNA sequencing data analysis , 2014, Bioinform..

[42]  David P. Kreil,et al.  Cross-platform ultradeep transcriptomic profiling of human reference RNA samples by RNA-Seq , 2014, Scientific Data.

[43]  R. Xu,et al.  Loss of ASAP3 destabilizes cytoskeletal protein ACTG1 to suppress cancer cell migration. , 2014, Molecular medicine reports.

[44]  F. Mitelman,et al.  Mitelman database of chromosome aberrations and gene fusions in cancer , 2014 .

[45]  Mingyao Li,et al.  PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform read distribution , 2013, Nucleic acids research.

[46]  O. Kallioniemi,et al.  FusionCatcher – a tool for finding somatic fusion genes in paired-end RNA-sequencing data , 2014, bioRxiv.

[47]  O. Griffith,et al.  Mitelman Database (Chromosome Aberrations and Gene Fusions in Cancer) , 2014 .

[48]  Yang Shi,et al.  Histone H3.3 and cancer: A potential reader connection , 2014, Proceedings of the National Academy of Sciences.

[49]  Tyson A. Clark,et al.  Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing , 2015, Nucleic acids research.

[50]  Gregory R. Grant,et al.  Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data , 2015, Bioinform..

[51]  A. Oshlack,et al.  JAFFA: High sensitivity transcriptome-focused fusion gene detection , 2015, Genome Medicine.

[52]  Dmitri D. Pervouchine,et al.  A benchmark for RNA-seq quantification pipelines , 2016, Genome Biology.

[53]  Adrian V. Lee,et al.  Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data , 2015, Nucleic acids research.

[54]  Hui Li,et al.  Comparative assessment of methods for the fusion transcripts detection from RNA-Seq data , 2016, Scientific Reports.

[55]  T. Meyer,et al.  InFusion: Advancing Discovery of Fusion Genes and Chimeric Transcripts from Deep RNA-Sequencing Data , 2016, PloS one.

[56]  R. Guigó,et al.  ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data , 2016, bioRxiv.

[57]  J. Horiguchi,et al.  Stathmin1 expression is associated with aggressive phenotypes and cancer stem cell marker expression in breast cancer patients , 2017, International journal of oncology.

[58]  Justin Chu,et al.  NanoSim: nanopore sequence read simulator based on statistical characterization , 2016, bioRxiv.

[59]  L. Pachter,et al.  Fusion detection and quantification by pseudoalignment , 2017, bioRxiv.

[60]  Hugh E. Olsen,et al.  Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells , 2017, Nature Communications.

[61]  Edmund R. S. Kunji,et al.  Expression and putative role of mitochondrial transport proteins in cancer. , 2017, Biochimica et biophysica acta. Bioenergetics.

[62]  Ajeet Singh,et al.  AtFusionDB: a database of fusion transcripts in Arabidopsis thaliana , 2019, Database.

[63]  Kai Wang,et al.  Evaluation of biological and technical variations in low-input RNA-Seq and single-cell RNA-Seq , 2018, Int. J. Comput. Biol. Drug Des..

[64]  J. Sáez-Rodríguez,et al.  Benchmark and integration of resources for the estimation of human transcription factor activities , 2018, bioRxiv.

[65]  Barbara Hutter,et al.  PO-400 Arriba – fast and accurate gene fusion detection from RNA-seq data , 2018, ESMO Open.

[66]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[67]  Y. Pawitan,et al.  A fast detection of fusion genes from paired-end RNA-seq data , 2018, BMC Genomics.

[68]  Angela N. Brooks,et al.  Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns , 2018, Nature Communications.

[69]  Eric E. Schadt,et al.  STAR Chimeric Post for rapid detection of circular RNA and fusion transcripts , 2017, bioRxiv.

[70]  B. Haas,et al.  Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods , 2019, Genome Biology.

[71]  Pora Kim,et al.  FusionScan: accurate prediction of fusion genes from RNA-Seq data , 2019, Genomics & informatics.

[72]  P. Kim,et al.  FusionScan: accurate prediction of fusion genes from RNA-Seq data , 2019, Genomics & informatics.

[73]  Christian H. Holland,et al.  Benchmark and integration of resources for the estimation of human transcription factor activities. , 2019, Genome research.

[74]  Angela N. Brooks,et al.  Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns , 2018, Nature Communications.

[75]  Inanc Birol,et al.  Fusion-Bloom: fusion detection in assembled transcriptomes , 2019, Bioinform..

[76]  Zlatko Trajanoski,et al.  NeoFuse: predicting fusion neoantigens from RNA sequencing data , 2019, Bioinform..

[77]  Mingyao Li,et al.  LIQA: long-read isoform quantification and analysis , 2020, Genome Biology.