A Unified Deep Learning Framework for Single-Cell ATAC-Seq Analysis Based on ProdDep Transformer Encoder

Recent advances in single-cell sequencing assays for the transposase-accessibility chromatin (scATAC-seq) technique have provided cell-specific chromatin accessibility landscapes of cis-regulatory elements, providing deeper insights into cellular states and dynamics. However, few research efforts have been dedicated to modeling the relationship between regulatory grammars and single-cell chromatin accessibility and incorporating different analysis scenarios of scATAC-seq data into the general framework. To this end, we propose a unified deep learning framework based on the ProdDep Transformer Encoder, dubbed PROTRAIT, for scATAC-seq data analysis. Specifically motivated by the deep language model, PROTRAIT leverages the ProdDep Transformer Encoder to capture the syntax of transcription factor (TF)-DNA binding motifs from scATAC-seq peaks for predicting single-cell chromatin accessibility and learning single-cell embedding. Based on cell embedding, PROTRAIT annotates cell types using the Louvain algorithm. Furthermore, according to the identified likely noises of raw scATAC-seq data, PROTRAIT denoises these values based on predated chromatin accessibility. In addition, PROTRAIT employs differential accessibility analysis to infer TF activity at single-cell and single-nucleotide resolution. Extensive experiments based on the Buenrostro2018 dataset validate the effeteness of PROTRAIT for chromatin accessibility prediction, cell type annotation, and scATAC-seq data denoising, therein outperforming current approaches in terms of different evaluation metrics. Besides, we confirm the consistency between the inferred TF activity and the literature review. We also demonstrate the scalability of PROTRAIT to analyze datasets containing over one million cells.

[1]  Yuhang Liu,et al.  Uncovering the Relationship between Tissue-Specific TF-DNA Binding and Chromatin Features through a Transformer-Based Model , 2022, Genes.

[2]  Kai Jiang,et al.  The encoding method of position embeddings in vision transformer , 2022, J. Vis. Commun. Image Represent..

[3]  Ruisheng Zhang,et al.  MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction. , 2022, Journal of molecular graphics & modelling.

[4]  Jiliu Zhou,et al.  Towards a better understanding of TF-DNA binding prediction from genomic features , 2022, Comput. Biol. Medicine.

[5]  Kyle J. Gaulton,et al.  Characterizing cis-regulatory elements using single-cell epigenomics , 2022, Nature reviews. Genetics.

[6]  Xiuwei Zhang,et al.  scDART: integrating unmatched scRNA-seq and scATAC-seq data and learning cross-modality relationship simultaneously , 2021, Genome Biology.

[7]  Jiliu Zhou,et al.  A novel convolution attention model for predicting transcription factor binding sites by combination of sequence and shape , 2021, Briefings Bioinform..

[8]  Quan Zou,et al.  By hybrid neural networks for prediction and interpretation of transcription factor binding sites based on multi-omics , 2021, 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[9]  Rafael Riudavets Puig,et al.  JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles , 2021, Nucleic Acids Res..

[10]  Mingbo Cheng,et al.  Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen , 2021, Nature Communications.

[11]  Kyle J. Gaulton,et al.  A single-cell atlas of chromatin accessibility in the human genome , 2021, Cell.

[12]  David R. Kelley,et al.  scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks , 2021, Nature Methods.

[13]  Jiliu Zhou,et al.  High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method , 2021, Briefings Bioinform..

[14]  K. Köhrer,et al.  The transcription factor reservoir and chromatin landscape in activated plasmacytoid dendritic cells , 2021, bioRxiv.

[15]  B. Ren,et al.  Comprehensive analysis of single cell ATAC-seq data with SnapATAC , 2021, Nature Communications.

[16]  Xiaohui S. Xie,et al.  SAILER: scalable and accurate invariant representation learning for single-cell ATAC-seq processing and integration , 2021, bioRxiv.

[17]  Insuk Lee,et al.  Single-cell ATAC sequencing analysis: From data preprocessing to hypothesis generation , 2020, Computational and structural biotechnology journal.

[18]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[19]  Xiaohui S. Xie,et al.  Predicting transcription factor binding in single cells through deep learning , 2020, bioRxiv.

[20]  L. Bullinger,et al.  Transcription factor 4 (TCF4) expression predicts clinical outcome in RUNX1 mutated and translocated acute myeloid leukemia. , 2019, Haematologica.

[21]  Daniel K Hartline,et al.  t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. , 2019, Marine genomics.

[22]  Tao Jiang,et al.  SCALE method for single-cell ATAC-seq analysis via latent feature extraction , 2019, Nature Communications.

[23]  K. Igarashi,et al.  To be red or white: lineage commitment and maintenance of the hematopoietic system by the “inner myeloid” , 2019, Haematologica.

[24]  S. Aerts,et al.  cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data , 2019, Nature Methods.

[25]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[26]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[27]  Aviv Regev,et al.  BROCKMAN: deciphering variance in epigenomic regulators by k-mer factorization , 2018, BMC Bioinformatics.

[28]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[29]  William J. Greenleaf,et al.  chromVAR: Inferring transcription factor-associated accessibility from single-cell epigenomic data , 2017, Nature Methods.

[30]  Cory Y. McLean,et al.  Sequential regulatory activity prediction across chromosomes with convolutional neural networks , 2017, bioRxiv.

[31]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[32]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[33]  D. Sherr,et al.  The Aryl Hydrocarbon Receptor (AhR) Regulates the Production of Bipotential Hematopoietic Progenitor Cells , 2012 .

[34]  K. Akashi,et al.  Reciprocal activation of GATA-1 and PU.1 marks initial specification of hematopoietic stem cells into myeloerythroid and myelolymphoid lineages. , 2007, Cell stem cell.

[35]  M. Gerstein,et al.  GATA-1 binding sites mapped in the β-globin locus by using mammalian chIp-chip analysis , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Xiaoyao Tan,et al.  BindTransNet: A Transferable Transformer-Based Architecture for Cross-Cell Type DNA-Protein Binding Sites Prediction , 2021, ISBRA.

[37]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[38]  Articles on similar topics can be found in the following Blood collections Hematopoiesis and Stem Cells (3094 articles) , 2007 .