Non‐parametric methods to predict HIV drug susceptibility phenotype from genotype

Medical management of HIV infection requires an understanding of the relationship between viral genetic sequences and viral susceptibility to antiretroviral drugs. Because of the high dimensionality of the data on viral genotype, traditional statistical methods are not well suited for investigating this relationship. We develop non-parametric methods specifically for the setting where high-dimensional data provides a basis for predicting a low-dimensional response variable. Our non-recursive methods proceed in three stages: (i) build models, in a forward-stepwise manner, that predict phenotype response from genotype sequence; (ii) identify specific patterns of amino acid sequence that are most influential in predicting phenotype, and (iii) identify combinations of codons that have either a concordant or a discordant association in the occurrence of a mutation. The methods are applied to a data set provided by the Virco Group that contains protease genome sequences and IC50 measurements on a drug from the protease inhibitor class, amprenavir, for 2747 patient samples. From these methods, we were able to identify eight codons from the protease region of the HIV genome that predict resistance to amprenavir, and to determine pairs of codons that tend either to occur together or to preclude the occurrence of the other member of the pair.