Multidimensional Scaling for Gene Sequence Data with Autoencoders

Multidimensional scaling of gene sequence data has long played a vital role in analysing gene sequence data to identify clusters and patterns. However the computation complexities and memory requirements of state-of-the-art dimensional scaling algorithms make it infeasible to scale to large datasets. In this paper we present an autoencoder-based dimensional reduction model which can easily scale to datasets containing millions of gene sequences, while attaining results comparable to state-of-the-art MDS algorithms with minimal resource requirements. The model also supports out-of-sample data points with a 99.5%+ accuracy based on our experiments. The proposed model is evaluated against DAMDS with a real world fungi gene sequence dataset. The presented results showcase the effectiveness of the autoencoder-based dimension reduction model and its advantages.

[1]  Alberto D. Pascual-Montano,et al.  A survey of dimensionality reduction techniques , 2014, ArXiv.

[2]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[3]  Marc Olano,et al.  Glimmer: Multilevel MDS on the GPU , 2009, IEEE Transactions on Visualization and Computer Graphics.

[4]  Geoffrey C. Fox,et al.  A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting , 2013, 2013 IEEE 9th International Conference on e-Science.

[5]  Garrison W. Cottrell,et al.  Non-Linear Dimensionality Reduction , 1992, NIPS.

[6]  Geoffrey C. Fox Deterministic annealing and robust scalable data mining for the data deluge , 2011, PDAC '11.

[7]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[8]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  G. Fox,et al.  TSmap 3 D : Browser Visualization of High Dimensional Time Series Data , 2016 .

[10]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[11]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[12]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Geoffrey C. Fox,et al.  Adaptive Interpolation of Multidimensional Scaling , 2012, ICCS.

[14]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[17]  Geoffrey C. Fox,et al.  Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm , 2010, 2010 IEEE Sixth International Conference on e-Science.

[18]  Hongxun Yao,et al.  Auto-encoder based dimensionality reduction , 2016, Neurocomputing.

[19]  Neda Tavakoli,et al.  Modeling Genome Data Using Bidirectional LSTM , 2019, 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC).

[20]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[21]  Carey E. Priebe,et al.  The out-of-sample problem for classical multidimensional scaling , 2008, Comput. Stat. Data Anal..

[22]  Jin Gu,et al.  VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder , 2018, Genom. Proteom. Bioinform..

[23]  Antoine Naud INTERACTIVE DATA EXPLORATION USING MDS MAPPING , 2000 .

[24]  Ali Ghodsi,et al.  Dimensionality Reduction A Short Tutorial , 2006 .

[25]  D. Sculley,et al.  Using deep learning to annotate the protein universe , 2019, Nature Biotechnology.

[26]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[27]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[28]  Nicolas Le Roux,et al.  Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering , 2003, NIPS.