MinIsoClust: Isoform clustering using minhash and locality sensitive hashing

With the advent of next-generation sequencing technologies, computational transcriptome assembly of RNA-Seq data has become a critical step in many biological and biomedical studies. The accuracy of these transcriptome assembly methods is hindered by the presence of alternatively spliced transcripts (isoforms). Identifying and quantifying isoforms is also essential in understanding complex biological functions, many of which are often associated with various diseases. However, clustering of isoform sequences using only sequence identities when quality reference genomes are not available is often difficult due to heterogeneous exon composition among isoforms. Clustering of a large number of transcript sequences also requires a scalable technique. In this study, we propose a minwise-hashing based method, MinIsoClust, for fast and accurate clustering of transcript sequences that can be used to identify groups of isoforms. We tested this new method using simulated datasets. The results demonstrated that MinIso-Clust was more accurate than CD-HIT-EST, isONclust, and MM-seqs2/Linclust. MinIsoClust also performed better than isONclust and MMseqs2/Linclust in terms of computational time and space efficiency. The source codes of MinIsoClust is freely available at https://github.com/srbehera/MinIsoClust.

[1]  E. Eyras,et al.  AtRTD – a comprehensive reference transcript dataset resource for accurate quantification of transcript‐specific expression in Arabidopsis thaliana , 2015, The New phytologist.

[2]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[3]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[4]  Hooman Zabeti,et al.  Improving MinHash via the containment index with applications to metagenomic analysis , 2019, Appl. Math. Comput..

[5]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): gene structure and function annotation , 2007, Nucleic Acids Res..

[6]  Verena Zimorski,et al.  Subcellular targeting of proteins and pathways during evolution. , 2014, The New phytologist.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  M. Cugmas,et al.  On comparing partitions , 2015 .

[9]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[10]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[11]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[12]  Geo Pertea,et al.  Transcriptome assembly from long-read RNA-seq alignments with StringTie2 , 2019, Genome Biology.

[13]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[14]  Rasmus Pagh,et al.  Set similarity search beyond MinHash , 2017, STOC.

[15]  Steven L Salzberg,et al.  Transcriptome assembly from long-read RNA-seq alignments with StringTie2 , 2019, Genome biology.

[16]  Paul Medvedev,et al.  De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm , 2018, bioRxiv.

[17]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[18]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[19]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[20]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[21]  A. Oshlack,et al.  Corset: enabling differential gene expression analysis for de novo assembled transcriptomes , 2014, Genome Biology.

[22]  R. Guigó,et al.  Are splicing mutations the most frequent cause of hereditary disease? , 2005, FEBS letters.

[23]  Ping Li,et al.  Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment , 2015, WWW.

[24]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[25]  E. Moriyama,et al.  Next-Generation Transcriptome Assembly: Strategies and Performance Analaysis , 2018, Bioinformatics in the Era of Post Genomics and Big Data.

[26]  J. Deogun,et al.  A consensus-based ensemble approach to improve transcriptome assembly , 2020, BMC Bioinformatics.

[27]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[28]  Luiz Irber,et al.  Large-scale sequence comparisons with sourmash , 2019, bioRxiv.