Clustering de Novo by Gene of Long Reads from Transcriptomics Data

Long-read sequencing currently provides sequences of several thousand base pairs. This allows to obtain complete transcripts, which offers an un-precedented vision of the cellular transcriptome. However the literature is lacking tools to cluster such data de novo, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution is both to propose a new algorithm adapted to clustering of reads by gene and a practical and free access tool that permits to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device, this dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate its is better-suited for transcriptomics long reads. When a reference is available thus mapping possible, we show that it stands as an alternative method that predicts complementary clusters.

[1]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[2]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[3]  Xiandong Meng,et al.  Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing , 2015, PloS one.

[4]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[5]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[6]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[7]  David Kainer,et al.  A comprehensive toolkit to enable MinION sequencing in any laboratory , 2018, bioRxiv.

[8]  John Quackenbush,et al.  The TIGR Gene Indices: reconstruction and representation of expressed gene sequences , 2000, Nucleic Acids Res..

[9]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[10]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  David A. Eccles,et al.  MinION Analysis and Reference Consortium: Phase 1 data release and analysis , 2015, F1000Research.

[12]  Franck Picard,et al.  High-quality sequence clustering guided by network topology and multiple alignment likelihood , 2012, Bioinform..

[13]  I. Birol,et al.  Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art , 2016, Bioinform..

[14]  Shao-Wu Zhang,et al.  DMclust, a Density‐based Modularity Method for Accurate OTU Picking of 16S rRNA Sequences , 2017, Molecular informatics.

[15]  J. Bouck,et al.  Comparison of gene indexing databases. , 1999, Trends in genetics : TIG.

[16]  Daniel R. Garalde,et al.  Highly parallel direct RNA sequencing on an array of nanopores , 2016, Nature Methods.

[17]  W Brad Barbazuk,et al.  Detecting alternatively spliced transcript isoforms from single‐molecule long‐read sequences without a reference genome , 2017, Molecular ecology resources.

[18]  Richard M Leggett,et al.  A world of opportunities with nanopore sequencing. , 2017, Journal of experimental botany.

[19]  Tao Jiang,et al.  SEED: efficient clustering of next-generation sequences , 2011, Bioinform..

[20]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[21]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[22]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[23]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[24]  Zechen Chong,et al.  Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads , 2012, Bioinform..

[25]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[26]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[27]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[28]  Juan Mei,et al.  Remote protein homology detection using a modularity-based approach , 2011, International Conference on Information Science and Technology.

[29]  Paul A. Bates,et al.  Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis , 2006, BMC Bioinformatics.

[30]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[31]  Robert Miller,et al.  STACK: Sequence Tag Alignment and Consensus Knowledgebase , 2001, Nucleic Acids Res..

[32]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[33]  J. K. Senior Partitions and Their Representative Graphs , 1951 .

[34]  Nam V. Hoang,et al.  A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing , 2017, BMC Genomics.

[35]  Kimberly R. Kukurba,et al.  RNA Sequencing and Analysis. , 2015, Cold Spring Harbor protocols.

[36]  David Laehnemann,et al.  Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[37]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[38]  Chunlei Wu,et al.  TCLUST: A Fast Method for Clustering Genome-Scale Expression Data , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[39]  B. Graveley,et al.  Determining exon connectivity in complex mRNAs by nanopore sequencing , 2015, Genome Biology.

[40]  N. Friedman,et al.  Trinity : reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2016 .

[41]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[42]  J. Hopcroft,et al.  Algorithm 447: efficient algorithms for graph manipulation , 1973, CACM.

[43]  A. Barabasi,et al.  Quantifying social group evolution , 2007, Nature.

[44]  Guillaume J. Filion,et al.  Starcode: sequence clustering based on all-pairs search , 2015, Bioinform..

[45]  Mihalis Yannakakis,et al.  The Complexity of Multiterminal Cuts , 1994, SIAM J. Comput..

[46]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[47]  Paolo Piazza,et al.  Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis , 2017, F1000Research.

[48]  John F. Mulley,et al.  Assessing the utility of the Oxford Nanopore MinION for snake venom gland cDNA sequencing , 2015, PeerJ.

[49]  R. Stoughton,et al.  Genetics of gene expression surveyed in maize, mouse and man , 2003, Nature.

[50]  Benjamin H. Good,et al.  Performance of modularity maximization in practical contexts. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[51]  G. Schuler Pieces of the puzzle: expressed sequence tags and the catalog of human genes , 1997, Journal of Molecular Medicine.

[52]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[53]  Faye D. Schilkey,et al.  A survey of the sorghum transcriptome using single-molecule long reads , 2016, Nature Communications.

[54]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[55]  David A. Eccles,et al.  De novo assembly of the complex genome of Nippostrongylus brasiliensis using MinION long reads , 2018, BMC Biology.

[56]  Stéphane Audic,et al.  Testing ecological theories with sequence similarity networks: marine ciliates exhibit similar geographic dispersal patterns as multicellular organisms , 2015, BMC Biology.

[57]  Lingli Wang,et al.  A Transcriptional Profile of Aging in the Human Kidney , 2004, PLoS biology.

[58]  Illés J. Farkas,et al.  CFinder: locating cliques and overlapping modules in biological networks , 2006, Bioinform..

[59]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[60]  Jiannis Ragoussis,et al.  Benchmarking of the Oxford Nanopore MinION sequencing for quantitative and qualitative assessment of cDNA populations , 2016, Scientific Reports.

[61]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[62]  Wing Hung Wong,et al.  Characterization of the human ESC transcriptome by hybrid sequencing , 2013, Proceedings of the National Academy of Sciences.

[63]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[64]  M. Kater,et al.  A Genomic View of Alternative Splicing of Long Non-coding RNAs during Rice Seed Development Reveals Extensive Splicing and lncRNA Gene Families , 2018, Front. Plant Sci..

[65]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Tyson A. Clark,et al.  Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing , 2016, Nature Communications.

[67]  Donald Sharon,et al.  A single-molecule long-read survey of the human transcriptome , 2013, Nature Biotechnology.

[68]  M. Gonzalez-Garay Introduction to Isoform Sequencing Using Pacific Biosciences Technology (Iso-Seq) , 2016 .

[69]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[70]  Aaron R. Quinlan,et al.  Poretools: a toolkit for analyzing nanopore sequence data , 2014, bioRxiv.

[71]  Michael Liem,et al.  Rapid de novo assembly of the European eel genome from nanopore sequencing reads , 2017, Scientific Reports.

[72]  Sergey Koren,et al.  De Novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing[CC-BY] , 2017, Plant Cell.