Sequence-based prediction of protein-protein interactions using ensemble based classifier combined with global encoding in HIV (human immunodeficiency virus)

Human Immunodeficiency Virus is a type of intracellular obligate retrovirus that attacks the human body’s immune system. This virus attacks by doing interaction between the virus and human proteins. This research uses data of amino acids sequence from protein that the feature will be modified using Global Encoding as feature extraction method and then combined with the Rotation Forest in predicting the interaction between HIV and human proteins. The Global Encoding method will first group 20 types of amino acids into 6 classes and then get 10 combinations each containing three different classes. Based on these 10 combinations, a protein sequence will be transformed into 10 characteristic sequence binaries. Each sequence characteristic is further divided into several subsets based on a partition method. Then, two types of protein descriptor, composition and transition, were extracted to represent each protein sequence and used as final input vectors for the classification method. Finally, Rotation Forest is used to predicting the class of protein interactions between humans and HIV proteins. The best model obtained in this research has an accuracy of 79.50 %, sensitivity of 79.91 %, specificity of 79.07 %, and precision of 79.77 % in predicting protein interactions between HIV and Human.Human Immunodeficiency Virus is a type of intracellular obligate retrovirus that attacks the human body’s immune system. This virus attacks by doing interaction between the virus and human proteins. This research uses data of amino acids sequence from protein that the feature will be modified using Global Encoding as feature extraction method and then combined with the Rotation Forest in predicting the interaction between HIV and human proteins. The Global Encoding method will first group 20 types of amino acids into 6 classes and then get 10 combinations each containing three different classes. Based on these 10 combinations, a protein sequence will be transformed into 10 characteristic sequence binaries. Each sequence characteristic is further divided into several subsets based on a partition method. Then, two types of protein descriptor, composition and transition, were extracted to represent each protein sequence and used as final input vectors for the classification method. Finally, Rotation Forest i...