Bags of Words Models of Epitope Sets: HIV Viral Load Regression with Counting Grids

The immune system gathers evidence of the execution of various molecular processes, both foreign and the cells' own, as time- and space-varying sets of epitopes, small linear or conformational segments of the proteins involved in these processes. Epitopes do not have any obvious ordering in this scheme: The immune system simply sees these epitope sets as disordered "bags" of simple signatures based on whose contents the actions need to be decided. The immense landscape of possible bags of epitopes is shaped by the cellular pathways in various cells, as well as the characteristics of the internal sampling process that chooses and brings epitopes to cellular surface. As a consequence, upon the infection by the same pathogen, different individuals' cells present very different epitope sets. Modeling this landscape should thus be a key step in computational immunology. We show that among possible bag-of-words models, the counting grid is most fit for modeling cellular presentation. We describe each patient by a bag-of-peptides they are likely to present on the cellular surface. In regression tests, we found that compared to the state-of-the-art, counting grids explain more than twice as much of the log viral load variance in these patients. This is potentially a significant advancement in the field, given that a large part of the log viral load variance also depends on the infecting HIV strain, and that HIV polymorphisms themselves are known to strongly associate with HLA types, both effects beyond what is modeled here.

[1]  Christos,et al.  Machine Learning Competition in Immunology – Prediction of HLA class I molecules , 2011 .

[2]  M. Nei,et al.  MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. , 2011, Molecular biology and evolution.

[3]  Alessandro Perina,et al.  Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray , 2010, PRIB.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Nebojsa Jojic,et al.  Mapping the Landscape of Host-Pathogen Coevolution: HLA Class I Binding and Its Relationship with Evolutionary Conservation in Human and Viral Proteins , 2010, Journal of Virology.

[6]  Inbal Budowski-Tal,et al.  FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately , 2010, Proceedings of the National Academy of Sciences.

[7]  Manuele Bicego,et al.  Bag of Peaks: interpretation of NMR spectrometry , 2009, Bioinform..

[8]  Alessandro Perina,et al.  Investigating Topic Models' Capabilities in Expression Microarray Data Classification , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[10]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[11]  Nebojsa Jojic,et al.  Multidimensional counting grids: Inferring word order from disordered bags of words , 2011, UAI.

[12]  A. Telenti,et al.  Phylogenetic Approach Reveals That Virus Genotype Largely Determines HIV Set-Point Viral Load , 2010, PLoS pathogens.

[13]  C. Moore,et al.  Evidence of HIV-1 Adaptation to HLA-Restricted Immune Responses at a Population Level , 2002, Science.

[14]  Nebojsa Jojic,et al.  Variable Selection through Correlation Sifting , 2011, RECOMB.

[15]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[16]  Bette Korber,et al.  Dominant influence of HLA-B in mediating the potential co-evolution of HIV and HLA , 2004, Nature.

[17]  Michele Tansella,et al.  Brain Morphometry by Probabilistic Latent Semantic Analysis , 2010, MICCAI.

[18]  O. Lund,et al.  NetMHCpan, a method for MHC class I binding prediction beyond humans , 2008, Immunogenetics.

[19]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[20]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[21]  S. Rowland-Jones,et al.  Cellular immune responses to HIV , 2001, Nature.

[22]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[23]  Colin Campbell,et al.  The latent process decomposition of cDNA microarray data sets , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.