Multi-source Hierarchical Prediction Consolidation

In big data applications such as healthcare data mining, due to privacy concerns, it is necessary to collect predictions from multiple information sources for the same instance, with raw features being discarded or withheld when aggregating multiple predictions. Besides, crowd-sourced labels need to be aggregated to estimate the ground truth of the data. Due to the imperfection caused by predictive models or human crowdsourcing workers, noisy and conflicting information is ubiquitous and inevitable. Although state-of-the-art aggregation methods have been proposed to handle label spaces with flat structures, as the label space is becoming more and more complicated, aggregation under a label hierarchical structure becomes necessary but has been largely ignored. These label hierarchies can be quite informative as they are usually created by domain experts to make sense of highly complex label correlations such as protein functionality interactions or disease relationships. We propose a novel multi-source hierarchical prediction consolidation method to effectively exploits the complicated hierarchical label structures to resolve the noisy and conflicting information that inherently originates from multiple imperfect sources. We formulate the problem as an optimization problem with a closed-form solution. The consolidation result is inferred in a totally unsupervised, iterative fashion. Experimental results on both synthetic and real-world data sets show the effectiveness of the proposed method over existing alternatives.

[1]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[2]  Bonnie Berger,et al.  Exploiting ontology graph for predicting sparsely annotated gene function , 2015, Bioinform..

[3]  Alok N. Choudhary,et al.  On active learning in hierarchical classification , 2012, CIKM '12.

[4]  Philip S. Yu,et al.  Multilabel Consensus Classification , 2013, 2013 IEEE 13th International Conference on Data Mining.

[5]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[6]  Chi-Hoon Lee,et al.  Learning to combine discriminative classifiers: confidence based , 2010, KDD.

[7]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[8]  James T. Kwok,et al.  Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[10]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[11]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[12]  Huan Liu,et al.  Exploring Implicit Hierarchical Structures for Recommender Systems , 2015, IJCAI.

[13]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[14]  A. Ravindran,et al.  Engineering Optimization: Methods and Applications , 2006 .

[15]  David M. Pennock,et al.  An Empirical Comparison of Algorithms for Aggregating Expert Predictions , 2006, UAI.

[16]  Philip S. Yu,et al.  OnlineCM: Real-time Consensus Classification with Missing Values , 2015, SDM.

[17]  Zhiwen Yu,et al.  Transductive multi-label ensemble classification for protein function prediction , 2012, KDD.

[18]  Michael William Newman,et al.  The Laplacian spectrum of graphs , 2001 .

[19]  Yizhou Sun,et al.  Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models , 2009, NIPS.

[20]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[21]  James T. Kwok,et al.  MultiLabel Classification on Tree- and DAG-Structured Hierarchies , 2011, ICML.