A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

Ensemble methods, which combine several classifiers, have been successfully applied to decrease generalization error of machine learning methods. For most ensemble methods the ensemble members are combined by weighted summation of the output, called the linear average predictor. The logarithmic opinion pool ensemble method uses a multiplicative combination of the ensemble members, which treats the outputs of the ensemble members as independent probabilities. The advantage of the logarithmic opinion pool is the connection to the Kullback-Leibler error function, which can be decomposed into two terms: An average of the error of the ensemble members, and the ambiguity. The ambiguity is independent of the target function, and can be estimated using unlabeled data. The advantage of the decomposition is that an unbiased estimate of the generalization error of the ensemble can be obtained, while training still is on the full training set. These properties can be used to improve classification. The logarithmic opinion pool ensemble method is tested on the prediction of protein secondary structure. The focus is on how much improvement the general ensemble method can give rather than on outperforming existing methods, because that typically involves several more steps of refinement.