Analysis of B-cell receptor repertoires in COVID-19 patients using deep embedded representations of protein sequences

Analyzing B cell receptor (BCR) repertoires is immensely useful in evaluating one’s immunological status. Conventionally, repertoire analysis methods have focused on comprehensive assessments of clonal compositions, including V(D)J segment usage, nucleotide insertions/deletions, and amino acid distributions. Here, we introduce a novel computational approach that applies deep-learning-based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that this new approach enables us to not only accurately cluster BCR repertoires of coronavirus disease 2019 (COVID-19) patients and healthy subjects but also efficiently track minute changes in immune status over time as patients undergo treatment. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved a mean accuracy rate of over 87% given a repertoire of CDR3 sequences.

[1]  D. Pardoll,et al.  DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires , 2021, Nature Communications.

[2]  Michiaki Hamada,et al.  Representation learning applications in biological sequence analysis , 2021, bioRxiv.

[3]  William S. DeWitt,et al.  Dynamics of B cell repertoires and emergence of cross-reactive responses in patients with different severities of COVID-19 , 2021, Cell Reports.

[4]  P. Lyons,et al.  B cell receptor repertoire analysis in six immune-mediated diseases , 2019, Nature.

[5]  Michiaki Hamada,et al.  Representation learning applications in biological sequence analysis , 2021, Computational and structural biotechnology journal.

[6]  Y. Kluger,et al.  Alignment free identification of clones in B cell receptor repertoires , 2020, bioRxiv.

[7]  C. Deane,et al.  Deep Sequencing of B Cell Receptor Repertoires From COVID-19 Patients Reveals Strong Convergent Immune Signatures , 2020, bioRxiv.

[8]  M. Addo,et al.  Next-Generation Sequencing of T and B Cell Receptor Repertoires from COVID-19 Patients Showed Signatures Associated with Severity of Disease , 2020, Immunity.

[9]  C. Deane,et al.  Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires , 2018, The Journal of Immunology.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Alignment free identification of clones in B cell receptor repertoires , 2020, Nucleic acids research.

[12]  Mark M. Davis,et al.  Analyzing the M. tuberculosis immune response by T cell receptor clustering with GLIPH2 and genome-wide antigen screening , 2020, Nature Biotechnology.

[13]  Samuel E. Jones,et al.  Genetic determinants of daytime napping and effects on cardiometabolic health , 2020, Nature Communications.

[14]  C. Deane,et al.  Structural diversity of B-cell receptor repertoires along the B-cell differentiation axis in humans and mice , 2020, PLoS computational biology.

[15]  M. Seong,et al.  Stereotypic neutralizing VH antibodies against SARS-CoV-2 spike protein receptor binding domain in patients with COVID-19 and healthy individuals , 2021, Science Translational Medicine.

[16]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.