Quality of information-based source assessment and selection

Multiple information sources for the same set of objects can provide different representations, and combining their advantages may improve the predictive power for a given task. However, it is noticeable that some sources might be irrelevant or redundant. Thus, it is meaningful to select a set of good information sources that could help improve the learning performance, and very little work has been reported on this topic. In this paper, we first identify the two aspects of quality of information, source significance and source redundancy. In particular, significance represents the degree to which an information source contributes to the classification, and redundancy implies the information overlap among different information sources. We then propose a metric that combines neighborhood mutual information with a Max-Significance-Min-Redundancy algorithm, allowing us to select a compact set of superior information sources for classification learning. Extensive experiments show that the metric is very helpful in finding good information sources, and that the proposed method outperforms many other methods.

[1]  Witold Pedrycz,et al.  Measuring relevance between discrete and continuous features based on neighborhood mutual information , 2011, Expert Syst. Appl..

[2]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[3]  Dacheng Tao,et al.  Grassmannian Regularized Structured Multi-View Embedding for Image Classification , 2013, IEEE Transactions on Image Processing.

[4]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[5]  Jianzhong Li,et al.  Incremental Detection of Inconsistencies in Distributed Data , 2014, IEEE Trans. Knowl. Data Eng..

[6]  Jian Pei,et al.  Clustering in applications with multiple data sources - A mutual subspace clustering approach , 2012, Neurocomputing.

[7]  Paul M. Thompson,et al.  Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data , 2012, NeuroImage.

[8]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[9]  Jun Yu,et al.  Pairwise constraints based multiview features fusion for scene classification , 2013, Pattern Recognit..

[10]  Yishi Zhang,et al.  Divergence-based feature selection for separate classes , 2013, Neurocomputing.

[11]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[12]  Naonori Ueda,et al.  Adaptive semi-supervised learning on labeled and unlabeled data with different distributions , 2012, Knowledge and Information Systems.

[13]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[14]  Xindong Wu,et al.  CLAP: Collaborative pattern mining for distributed information systems , 2011, Decis. Support Syst..

[15]  Yizhou Sun,et al.  Heterogeneous source consensus learning via decision propagation and negotiation , 2009, KDD.

[16]  Xuehua Wang,et al.  Feature selection for high-dimensional imbalanced data , 2013, Neurocomputing.

[17]  Yong Luo,et al.  Multiview Vector-Valued Manifold Regularization for Multilabel Image Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Tao Li,et al.  Semisupervised learning from different information sources , 2005, Knowledge and Information Systems.

[19]  Yongdong Zhang,et al.  Multiview Spectral Embedding , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[20]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[21]  Yuan Yan Tang,et al.  Multiview Hessian discriminative sparse coding for image annotation , 2013, Comput. Vis. Image Underst..

[22]  Wenfei Fan,et al.  View determinacy for preserving selected information in data transformations , 2012, Inf. Syst..

[23]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[24]  Xiaomei Li,et al.  Mining stable patterns in multiple correlated databases , 2013, Decis. Support Syst..

[25]  Xindong Wu,et al.  Mining globally interesting patterns from multiple databases using kernel estimation , 2009, Expert Syst. Appl..

[26]  Thomas Seidl,et al.  GAMer: a synthesis of subspace clustering and dense subgraph mining , 2013, Knowledge and Information Systems.

[27]  Sang-goo Lee,et al.  A graph-theoretic approach to optimize keyword queries in relational databases , 2013, Knowledge and Information Systems.

[28]  Jiawei Han,et al.  Quality of Information Based Data Selection and Transmission in Wireless Sensor Networks , 2012, 2012 IEEE 33rd Real-Time Systems Symposium.

[29]  Lei Wang,et al.  On Similarity Preserving Feature Selection , 2013, IEEE Transactions on Knowledge and Data Engineering.

[30]  Dacheng Tao,et al.  Large-margin multi-view Gaussian process for image classification , 2013, ICIMCS '13.

[31]  Sanjay Jha,et al.  The design and evaluation of a hybrid sensor network for Cane-Toad monitoring , 2005 .

[32]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[33]  Jun Yu,et al.  Image classification by multimodal subspace learning , 2012, Pattern Recognit. Lett..

[34]  Yizhou Sun,et al.  Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models , 2009, NIPS.

[35]  Weifeng Liu,et al.  Multiview Hessian Regularization for Image Annotation , 2013, IEEE Transactions on Image Processing.

[36]  Sanjay Jha,et al.  The design and evaluation of a hybrid sensor network for cane-toad monitoring , 2005, IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005..

[37]  Li Guo,et al.  Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams , 2010, 2010 IEEE International Conference on Data Mining.

[38]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Daoqiang Zhang,et al.  A novel ensemble construction method for multi-view data using random cross-view correlation between within-class examples , 2011, Pattern Recognit..

[40]  Animesh Adhikari,et al.  Synthesizing heavy association rules from different real data sources , 2008, Pattern Recognit. Lett..

[41]  Jun Yu,et al.  On Combining Multiple Features for Cartoon Character Retrieval and Clip Synthesis , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[42]  Philip S. Yu,et al.  Efficient classification across multiple database relations: a CrossMine approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[43]  Xindong Wu,et al.  Database classification for multi-database mining , 2005, Inf. Syst..

[44]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[45]  Meng Wang,et al.  Semisupervised Multiview Distance Metric Learning for Cartoon Synthesis , 2012, IEEE Transactions on Image Processing.

[46]  Ruoming Jin,et al.  Multiple Information Sources Cooperative Learning , 2009, IJCAI.

[47]  Philip S. Yu,et al.  Learning from Heterogeneous Sources via Gradient Boosting Consensus , 2012, SDM.

[48]  Ulf Leser,et al.  Improving data quality by source analysis , 2012, JDIQ.

[49]  Philip S. Yu,et al.  Transfer across Completely Different Feature Spaces via Spectral Embedding , 2013, IEEE Transactions on Knowledge and Data Engineering.

[50]  Xindong Wu,et al.  Synthesizing High-Frequency Rules from Different Data Sources , 2003, IEEE Trans. Knowl. Data Eng..

[51]  Hui Xiong,et al.  Cross-Domain Learning from Multiple Sources: A Consensus Regularization Perspective , 2010, IEEE Transactions on Knowledge and Data Engineering.

[52]  Kaizhu Huang,et al.  m-SNE: Multiview Stochastic Neighbor Embedding , 2011, IEEE Trans. Syst. Man Cybern. Part B.

[53]  Jingcheng Wang,et al.  Neighborhood effective information ratio for hybrid feature subset evaluation and selection , 2013, Neurocomputing.