A co‐training algorithm for multi‐view data with applications in data fusion

In several scientific applications, data are generated from two or more diverse sources (views) with the goal of predicting an outcome of interest. Often it is the case that the outcome is not associated with any single view. However, the synergy of all measurements from each view may yield a more predictive classifier. For example, consider a drug discovery application in which individual molecules are described partially by several assay screens based on diverse profiles and partially by their chemical structural fingerprints. A common classification problem is to determine whether the molecule is associated with a particular disease. In this paper, a co‐training algorithm is developed to utilize data from diverse sources to predict the common class variable. Novel enhancements for variable importance, robustness to a mislabeled class variable, and a technique to handle unbalanced classes are applied to the motivating data set, highlighting that the approach attains strong performance and provides useful diagnostics for data analytic purposes. In addition, comparisons to a framework with data fusion using partial least squares (PLS) are also assessed on real data. An R package for performing the proposed approach is provided as Supporting information. Copyright © 2003 John Wiley & Sons, Ltd.

[1]  Steven P. Abney Understanding the Yarowsky Algorithm , 2004, CL.

[2]  Rasmus Bro,et al.  PARAFASCA: ASCA combined with PARAFAC for the analysis of metabolic fingerprinting data , 2008 .

[3]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[4]  C. Braak,et al.  Comments on the PLS kernel algorithm , 1994 .

[5]  W N Hunter,et al.  Rational drug design: a multidisciplinary approach. , 1995, Molecular medicine today.

[6]  T. Næs,et al.  A comparison of methods for analysing regression models with both spectral and designed variables , 2004 .

[7]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[8]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Michael D. Morris,et al.  Estimating the number of pure chemical components in a mixture by maximum likelihood , 2007 .

[11]  Y. Heyden,et al.  Geographical classification of olive oils by the application of CART and SVM to their FT‐IR , 2007 .

[12]  Riccardo Leardi,et al.  Application of genetic algorithm–PLS for feature selection in spectral data sets , 2000 .

[13]  Yoshihiro Yamanishi,et al.  Protein network inference from multiple genomic data: a supervised approach , 2004, ISMB/ECCB.

[14]  H. B. Mitchell,et al.  Multi-Sensor Data Fusion: An Introduction , 2007 .

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  Julian J. Faraway Modeling continuous shape change for facial animation , 2004, Stat. Comput..

[17]  Beata Walczak,et al.  Identifying potential biomarkers in LC‐MS data , 2007 .

[18]  Roger L. Lundblad,et al.  Chemical Reagents for Protein Modification , 1984 .

[19]  Andrew R. Leach,et al.  An Introduction to Chemoinformatics , 2003 .

[20]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[21]  Ulf Brefeld,et al.  Co-EM support vector learning , 2004, ICML.

[22]  G. Michailidis,et al.  An Iterative Algorithm for Extending Learners to a Semi-Supervised Setting , 2008 .