Combining multiple views: Case studies on protein and arrhythmia features

Computational annotation of protein functions and structures from sequence features, or prediction of certain diseases from gene expression levels are among important applications of computational biology. Developing methods capable of such predictions are not only important in terms of their biological and medical uses but also a very challenging task of pattern recognition due to high input dimensionality and small sample size. Ensemble and multi-view learning has gained popularity due to the rapid rise of such datasets (such as the protein and arrhythmia datasets used in this paper) with large numbers of variables. However, the classical ensemble approach does not take into account conditional interdependences among the views. In this paper, we present a two stage supervised multi-view learning technique called parallel interacting multi-view learning (PIML). In the first stage of PIML, similar to the ensemble method, the views are individually used by a predictor, and the class posterior probability estimates are obtained. In the second stage, each view is trained using its own features along with the class posterior probability estimates of the other views as the summary information of other views. This is a hybrid way of combining the views in which the views influence each other during training using the predictions of others interdependences. PIML is demonstrated and compared with the classical ensemble approach on three real datasets.

[1]  Francis K. H. Quek,et al.  Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets , 2003, Pattern Recognit..

[2]  Tobias Scheffer,et al.  Learning With Multiple Views , 2005 .

[3]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[4]  Abraham T. Mathew,et al.  Classification of Arrhythmia Using Hybrid Networks , 2011, Journal of Medical Systems.

[5]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[7]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[8]  Ethem Alpaydin,et al.  Localized multiple kernel learning , 2008, ICML '08.

[9]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[10]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[11]  권홍우,et al.  Bootstrapping , 2002, ACL.

[12]  Chih-Jen Lin,et al.  A Comparison of Methods for Multi-class Support Vector Machines , 2015 .

[13]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[14]  Amjed S. Al-Fahoum,et al.  A quantitative analysis approach for cardiac arrhythmia classification using higher order spectral techniques , 2005, IEEE Transactions on Biomedical Engineering.

[15]  F. Gurgen,et al.  Parallel interacting multiview learning: An application to prediction of protein sub-nuclear location , 2009, 2009 9th International Conference on Information Technology and Applications in Biomedicine.

[16]  H. A. Guvenir,et al.  A supervised machine learning algorithm for arrhythmia analysis , 1997, Computers in Cardiology 1997.

[17]  Dimitrios I. Fotiadis,et al.  An arrhythmia classification system based on the RR-interval signal , 2005, Artif. Intell. Medicine.

[18]  L. Jiang,et al.  PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[19]  Simon Parsons,et al.  Introduction to Machine Learning, Second Editon by Ethem Alpaydin, MIT Press, 584 pp., ISBN 978-0-262-01243-0 , 2010, The Knowledge Engineering Review.

[20]  Hee-Joong Kang,et al.  A framework for probabilistic combination of multiple classifiers at an abstract level , 1997 .

[21]  Oleg Okun,et al.  Multiple Views in Ensembles of Nearest Neighbor Classifiers , 2005 .

[22]  Kuo-Chen Chou,et al.  Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. , 2007, Protein engineering, design & selection : PEDS.

[23]  Trevor Darrell,et al.  Multi-View Learning in the Presence of View Disagreement , 2008, UAI 2008.

[24]  Teck Wee Chua,et al.  Non-singleton genetic fuzzy logic system for arrhythmias classification , 2011, Eng. Appl. Artif. Intell..

[25]  G. Bontempi,et al.  A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Lukasz A. Kurgan,et al.  Prediction of structural classes for protein sequences and domains - Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy , 2006, Pattern Recognit..

[27]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[28]  S. S. Nanuwa,et al.  Investigation into the role of sequence-driven-features for prediction of protein structural classes , 2008, 2008 8th IEEE International Conference on BioInformatics and BioEngineering.