Multisource transfer learning for host-pathogen protein interaction prediction in unlabeled tasks

We consider the problem of building a predictive model for host-pathogen protein interactions, when there are no known interactions available. Our goal is to predict the protein protein interactions (PPIs) between the plant host Arabidopsis thaliana and the bacterial species Salmonella typhimurium. Our method based on transfer learning, utilizes labeled data i.e known interactions from other species (we call these the source tasks). The first challenge is to pick the best instances from the source tasks, such that the resultant model when applied on the target task generates high confidence predictions. Towards this, we use the instance reweighting technique Kernel Mean Matching (KMM). The reweighted instances are used to build a kernelized support vector machine (SVM) model, which is applied on the target data. This brings forth the second challenge selecting appropriate hyperparameters while building a model for a task with no labeled data. For the purposes of evaluation, we apply our method on a task where we have some labeled data available. We find that choice of the right source examples makes a significant difference in performance on the target task.