Various emerging quantitative measurement technologies are producing genome, transcriptome and proteome-wide data collections which has motivated the development of data integration methods within an inferential framework. It has been demonstrated that for certain prediction tasks within computational biology synergistic improvements in performance can be obtained via integration of a number of (possibly heterogeneous) data sources. In [1] six different parameter representations of proteins were employed for fold recognition of proteins using Support Vector Machines (SVM). It was observed that certain dataset combinations provided increased accuracy over the use of any single datset. Likewise in [2] a comprehensive experimental study observed improvements in SVM based gene function prediction when data from both microarray expression and phylogentic profiles were combined. More recently protein network inference was shown to be improved when various genomic data sources were integrated [3]. In [4] it was shown that superior prediction accuracy of protein-protein interactions was obtainable when a number of diverse data types were combined in an SVM. Whilst all of these papers exploited the kernel method [5] in providing a means of data fusion within SVM based classifiers it was only in [6] that a means of estimating an optimal linear combination of the kernel functions was presented using semi-definite programming. However, the methods developed in [6] are based on binary SVM’s, whilst arguably the majority of classification problems within computational biology are inherently multiclass. It is unclear how this approach could be extended to discrimination over multiple-classes. In addition the SVM is non-probabilistic and whilst post hoc methods for obtaining predictive probabilities are available [7] these are not without problems such as overfitting. On the other hand Gaussian Process (GP) methods [8] for classification provide a very natural way to both integrate and infer optimal combinations of multiple heterogeneous datasets via composite covariance functions within the Bayesian framework. In this paper it is shown that GP’s can be employed on large scale bioinformatics problems where there are multiple data sources and an example of protein fold prediction [1] is provided.
[1]
Jason Weston,et al.
Learning Gene Functional Classifications from Multiple Data Types
,
2002,
J. Comput. Biol..
[2]
Nello Cristianini,et al.
A statistical framework for genomic data fusion
,
2004,
Bioinform..
[3]
Mark Girolami,et al.
Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors
,
2006,
Neural Computation.
[4]
Yoshihiro Yamanishi,et al.
Protein network inference from multiple genomic data: a supervised approach
,
2004,
ISMB/ECCB.
[5]
Nello Cristianini,et al.
Kernel Methods for Pattern Analysis
,
2003,
ICTAI.
[6]
William Stafford Noble,et al.
Kernel methods for predicting protein-protein interactions
,
2005,
ISMB.
[7]
Chris H. Q. Ding,et al.
Multi-class protein fold recognition using support vector machines and neural networks
,
2001,
Bioinform..