Towards the Latent Transcriptome

In this work we propose a method to compute continuous embeddings for kmers from raw RNA-seq data, in a reference-free fashion. We report that our model captures information of both DNA sequence similarity as well as DNA sequence abundance in the embedding latent space. We confirm the quality of these vectors by comparing them to known gene sub-structures and report that the latent space recovers exon information from raw RNA-Seq data from acute myeloid leukemia patients. Furthermore we show that this latent space allows the detection of genomic abnormalities such as translocations as well as patient-specific mutations, making this representation space both useful for visualization as well as analysis.

[1]  Sébastien Lemieux,et al.  Target variant detection in leukemia using unaligned RNA-Seq reads , 2018, bioRxiv.

[2]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[3]  Sabrina Jaeger,et al.  Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition , 2018, J. Chem. Inf. Model..

[4]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[5]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[6]  A. Schneider-Gädicke,et al.  ZFX has a gene structure similar to ZFY, the putative human sex determinant, and escapes X inactivation , 1989, Cell.

[7]  P N Goodfellow,et al.  Comparison of human ZFY and ZFX transcripts. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8]  L. J. K. Wee,et al.  A machine learning approach for the identification of key markers involved in brain development from single-cell transcriptomic data , 2016, BMC Genomics.

[9]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[10]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[11]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Christine Chomienne,et al.  The PML-RARα fusion mRNA generated by the t(15;17) translocation in acute promyelocytic leukemia encodes a functionally altered RAR , 1991, Cell.

[14]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[15]  Yuan Liu,et al.  Mutation status coupled with RNA-sequencing data can efficiently identify important non-significantly mutated genes serving as diagnostic biomarkers of endometrial cancer , 2017, BMC Bioinformatics.

[16]  Benjamin E. Gross,et al.  Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal , 2013, Science Signaling.

[17]  A. Dejean,et al.  The t(15;17) translocation in acute promyelocytic leukemia. , 1994, Pathologie-biologie.

[18]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[19]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[20]  Gary D Stormo,et al.  An Introduction to Sequence Similarity (“Homology”) Searching , 2009, Current protocols in bioinformatics.

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[23]  P. Campbell,et al.  Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes , 2015, Nature Communications.

[24]  C. Croce,et al.  Chromosomal locations of mouse immunoglobulin genes. , 1978, Proceedings of the National Academy of Sciences of the United States of America.