Anomaly detection in genomic catalogues using unsupervised multi-view autoencoders

Background Accurate identification of Transcriptional Regulator binding locations is essential for analysis of genomic regions, including Cis Regulatory Elements. The customary NGS approaches, predominantly ChIP-Seq, can be obscured by data anomalies and biases which are difficult to detect without supervision. Results Here, we develop a method to leverage the usual combinations between many experimental series to mark such atypical peaks. We use deep learning to perform a lossy compression of the genomic regions’ representations with multiview convolutions. Using artificial data, we show that our method correctly identifies groups of correlating series and evaluates CRE according to group completeness. It is then applied to the ReMap database’s large volume of curated ChIP-seq data. We show that peaks lacking known biological correlators are singled out and less confirmed in real data. We propose normalization approaches useful in interpreting black-box models. Conclusion Our approach detects peaks that are less corroborated than average. It can be extended to other similar problems, and can be interpreted to identify correlation groups. It is implemented in an open-source tool called atyPeak.

[1]  M. Facciotti,et al.  Evaluation of Algorithm Performance in ChIP-Seq Peak Detection , 2010, PloS one.

[2]  R. Daber,et al.  Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets. , 2013, Cancer genetics.

[3]  Tobias Straub,et al.  Active promoters give rise to false positive ‘Phantom Peaks’ in ChIP-seq experiments , 2015, Nucleic acids research.

[4]  Alexander van Oudenaarden,et al.  Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins , 2013, Proceedings of the National Academy of Sciences.

[5]  A. Shilatifard,et al.  AFF4, a component of the ELL/P-TEFb elongation complex and a shared subunit of MLL chimeras, can link transcription elongation to leukemia. , 2010, Molecular cell.

[6]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[7]  C. Chen,et al.  Discovery of cell-type specific regulatory elements in the human genome using differential chromatin modification analysis , 2013, Nucleic acids research.

[8]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[9]  Fawzi Nashashibi,et al.  Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation , 2018, 2018 International Conference on 3D Vision (3DV).

[10]  Lovekesh Vig,et al.  LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection , 2016, ArXiv.

[11]  Daniel Quang,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015 .

[12]  Theodore J. Perkins,et al.  RECAP reveals the true statistical significance of ChIP-seq peak calls , 2018 .

[13]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[14]  Danai Koutra,et al.  Graph based anomaly detection and description: a survey , 2014, Data Mining and Knowledge Discovery.

[15]  Jaakko Lehtinen,et al.  Noise2Noise: Learning Image Restoration without Clean Data , 2018, ICML.

[16]  Long Gao,et al.  Discover context-specific combinatorial transcription factor interactions by integrating diverse ChIP-Seq data sets , 2013, Nucleic acids research.

[17]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[18]  Benoît Ballester,et al.  ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments , 2017, Nucleic Acids Res..

[19]  Aseel Awdeh,et al.  RECAP reveals the true statistical significance of ChIP-seq peak calls , 2018, bioRxiv.

[20]  Advanced Concepts for Intelligent Vision Systems , 2010, Lecture Notes in Computer Science.

[21]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[22]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[23]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[24]  Anshul Kundaje,et al.  The ENCODE Blacklist: Identification of Problematic Regions of the Genome , 2019, Scientific Reports.

[25]  Pallabi Parveen,et al.  Autoencoder Evaluation and Hyper-Parameter Tuning in an Unsupervised Setting , 2018, 2018 IEEE International Congress on Big Data (BigData Congress).

[26]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[27]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[28]  Huawen Liu,et al.  Recent Progress of Anomaly Detection , 2019, Complex..

[29]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[30]  M. Mann,et al.  The ETS family member GABPα modulates androgen receptor signalling and mediates an aggressive phenotype in prostate cancer , 2014, Nucleic acids research.

[31]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[32]  Olga G. Troyanskaya,et al.  An effective statistical evaluation of ChIPseq dataset similarity , 2012, Bioinform..

[33]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[34]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[35]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[36]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[37]  Matthew E. Gosden,et al.  Tissue-specific CTCF/Cohesin-mediated chromatin architecture delimits enhancer interactions and function in vivo , 2017, Nature Cell Biology.

[38]  Panos Kalnis,et al.  Progress and challenges in bioinformatics approaches for enhancer identification , 2015, Briefings Bioinform..

[39]  Sanjay Chawla,et al.  Group Anomaly Detection using Deep Generative Models , 2018, ECML/PKDD.

[40]  Jianwen Fang,et al.  Tightly integrated genomic and epigenomic data mining using tensor decomposition , 2018, Bioinform..

[41]  Mikhail Zriakhov,et al.  Lossy Compression of Images with Additive Noise , 2005, ACIVS.

[42]  B. Ballester,et al.  ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments , 2019, Nucleic Acids Res..

[43]  Steve Harenberg,et al.  Anomaly detection in dynamic networks: a survey , 2015 .

[44]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[45]  Jian Li,et al.  AddGraph: Anomaly Detection in Dynamic Graph Using Attention-based Temporal GCN , 2019, IJCAI.

[46]  K. Zhao,et al.  ChIP-Seq: technical considerations for obtaining high-quality data , 2011, Nature Immunology.

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  Lucas Theis,et al.  Lossy Image Compression with Compressive Autoencoders , 2017, ICLR.

[49]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[50]  Anshul Kundaje,et al.  Denoising genome-wide histone ChIP-seq with convolutional neural networks , 2016, bioRxiv.

[51]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[52]  Shuibin Lin,et al.  Proteomic and Functional Analyses Reveal the Role of Chromatin Reader SFMBT1 in Regulating Epigenetic Silencing and the Myogenic Gene Program* , 2013, The Journal of Biological Chemistry.

[53]  T. Hughes,et al.  The Human Transcription Factors , 2018, Cell.