Marginalised stack denoising autoencoders for metagenomic data binning

Shotgun sequencing has facilitated the analysis of complex microbial communities. Recently we have shown how local binary patterns (LBP) from image processing can be used to analyse the sequenced samples. LBP codes represent the data in a sparse high dimensional space. To improve the performance of our pipeline, marginalised stacked autoencoders are used here to learn frequent LBP codes and map the high dimensional space to a lower dimension dense space. We demonstrate its performance using both low and high complexity simulated metagenomic data and compare the performance of our method with several existing techniques including principal component analysis (PCA) in the dimension reduction step and fc-mer frequency in feature extraction step.

[1]  Santosh Tirunagari,et al.  Local binary patterns as a feature descriptor in alignment-free visualisation of metagenomic data , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[2]  Nicholas W. D. Evans,et al.  A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[3]  Vasile Palade,et al.  A neural network based multi-classifier system for gene identification in DNA sequences , 2004, Neural Computing & Applications.

[4]  P Bernaola-Galván,et al.  Study of statistical correlations in DNA sequences. , 2002, Gene.

[5]  S. Tringe,et al.  MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm , 2014, Microbiome.

[6]  Blake A. Simmons,et al.  MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets , 2016, Bioinform..

[7]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[8]  Kilian Q. Weinberger,et al.  From sBoW to dCoT marginalized encoders for text representation , 2012, CIKM '12.

[9]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[10]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[11]  Anil Anthony Bharath,et al.  An image processing method , 2007 .

[12]  Matti Pietikäinen,et al.  Texture Analysis in Industrial Applications , 1996 .

[13]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[14]  Matti Pietikäinen,et al.  A comparative study of texture measures with classification based on featured distributions , 1996, Pattern Recognit..

[15]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[16]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[17]  B. Blaisdell,et al.  Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding , 1985, Journal of Molecular Evolution.

[18]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[19]  Todd Holden,et al.  ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes , 2007, SPIE Optical Engineering + Applications.

[20]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[21]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[22]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.