Data heterogeneity consideration in semi-supervised learning

Data heterogeneity nature is considered in machine learning context.An adaptive data graph construction method is proposed.Representative data identification has been studied. In class (cluster) formation process of machine learning techniques, data instances are usually assumed to have equal relevance. However, it is frequently not true. Such a situation is more typical in semi-supervised learning since we have to understand the data structure of both labeled and unlabeled data at the same time. In this paper, we investigate the organizational heterogeneity of data in semi-supervised learning using graph representation. This is because graph is a natural choice to characterize relationship between any pair of nodes or any pair of groups of nodes, consequently, strategical location of each node or each group of nodes can be determined by graph measures. Specifically, two issues are addressed: (1) We propose an adaptive graph construction method, we call AdaRadius, considering the heterogeneity of local interacting structure among nodes. As a result, it presents several interesting properties, namely adaptability to data density variations, low dependency on parameters setting, and reasonable computational cost, for both pool based and incremental data. (2) Moreover, we present heuristic criteria for selecting representative data samples to be labeled. Experimental study shows that selective labeling usually gets better classification results than random labeling. To our knowledge, it still lacks investigation on both issues up to now, therefore, our approach presents an important step toward the data heterogeneity characterization not only in semi-supervised learning, but also in general machine learning.

[1]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[2]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[5]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[6]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2006, Nucleic Acids Res..

[7]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[8]  Olaf Sporns,et al.  Complex network measures of brain connectivity: Uses and interpretations , 2010, NeuroImage.

[9]  Yuan Qi,et al.  Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification , 2005, NIPS.

[10]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[11]  Celso André R. de Sousa,et al.  Influence of Graph Construction on Semi-supervised Learning , 2013, ECML/PKDD.

[12]  Fei Wang,et al.  Label Propagation through Linear Neighborhoods , 2008, IEEE Trans. Knowl. Data Eng..

[13]  Alessandro Vespignani,et al.  Epidemic spreading in scale-free networks. , 2000, Physical review letters.

[14]  Filipi Nascimento Silva,et al.  Hierarchical Characterization of Complex Networks , 2004, cond-mat/0412761.

[15]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[16]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Miguel Á. Carreira-Perpiñán,et al.  Proximity Graphs for Clustering and Manifold Learning , 2004, NIPS.

[18]  Stephen P. Boyd,et al.  The Fastest Mixing Markov Process on a Graph and a Connection to a Maximum Variance Unfolding Problem , 2006, SIAM Rev..

[19]  Yamir Moreno,et al.  Absence of influential spreaders in rumor dynamics , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[21]  Liang Zhao,et al.  A nonparametric classification method based on K-associated graphs , 2011, Inf. Sci..

[22]  Ulrik Brandes,et al.  Network Analysis: Methodological Foundations (Lecture Notes in Computer Science) , 2005 .

[23]  Alessandro Vespignani,et al.  Velocity and hierarchical spread of epidemic outbreaks in scale-free networks. , 2003, Physical review letters.

[24]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[26]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[27]  Alneu de Andrade Lopes,et al.  Graph Construction Based on Labeled Instances for Semi-supervised Learning , 2014, 2014 22nd International Conference on Pattern Recognition.

[28]  Xinhua Zhang,et al.  Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms , 2006, NIPS.

[29]  Seth Pettie,et al.  A Randomized Time-Work Optimal Parallel Algorithm for Finding a Minimum Spanning Forest , 1999, RANDOM-APPROX.

[30]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[31]  Alessandro Vespignani,et al.  Dynamical Processes on Complex Networks , 2008 .

[32]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[33]  Liang Zhao,et al.  Selecting Nodes with Inhomogeneous Profile for Labeling for Network-Based Semi-supervised Learning , 2013, 2013 BRICS Congress on Computational Intelligence and 11th Brazilian Congress on Computational Intelligence.

[34]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[36]  Yong Li,et al.  ATTACK VULNERABILITY OF COMPLEX NETWORKS BASED ON LOCAL INFORMATION , 2007 .

[37]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[38]  Heiko Rieger,et al.  Random walks on complex networks. , 2004, Physical review letters.

[39]  Seth Pettie,et al.  Minimizing randomness in minimum spanning tree, parallel connectivity, and set maxima algorithms , 2002, SODA '02.

[40]  Ulrik Brandes,et al.  Network Analysis: Methodological Foundations , 2010 .

[41]  Alessandro Vespignani,et al.  Immunization of complex networks. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[42]  P. Holland,et al.  Transitivity in Structural Models of Small Groups , 1971 .

[43]  Hamid R. Rabiee,et al.  Supervised neighborhood graph construction for semi-supervised classification , 2012, Pattern Recognit..

[44]  S. Havlin,et al.  Breakdown of the internet under intentional attack. , 2000, Physical review letters.

[45]  Adrian Corduneanu,et al.  Data-Dependent Regularization , 2006, Semi-Supervised Learning.

[46]  Yamir Moreno,et al.  Locating privileged spreaders on an online social network. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[47]  Ahmad Akbari,et al.  An enhanced noise resilient K-associated graph classifier , 2015, Expert Syst. Appl..

[48]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[49]  Yijie Han,et al.  Concurrent threads and optimal parallel minimum spanning trees algorithm , 2001, JACM.

[50]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[51]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[52]  Liang Zhao,et al.  Data clustering using controlled consensus in complex networks , 2013, Neurocomputing.

[53]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[54]  Cohen,et al.  Resilience of the internet to random breakdowns , 2000, Physical review letters.

[55]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[56]  Fadi Dornaika,et al.  Graph-based semi-supervised learning with Local Binary Patterns for holistic object categorization , 2014, Expert Syst. Appl..

[57]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[58]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[59]  Alexander Zien,et al.  Data-Dependent Regularization , 2006 .

[60]  Stephanie Forrest,et al.  Email networks and the spread of computer viruses. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[61]  F. Radicchi,et al.  Benchmark graphs for testing community detection algorithms. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[62]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[63]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[64]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[65]  Benjamin H. Good,et al.  Performance of modularity maximization in practical contexts. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[66]  John C. Platt,et al.  Semi-Supervised Learning with Conditional Harmonic Mixing , 2006, Semi-Supervised Learning.

[67]  Albert-László Barabási,et al.  Error and attack tolerance of complex networks , 2000, Nature.

[68]  Francis Y. L. Chin,et al.  Algorithms for Updating Minimal Spanning Trees , 1978, J. Comput. Syst. Sci..

[69]  Liang Zhao,et al.  Detecting and labeling representative nodes for network-based semi-supervised learning , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[70]  Kilian Q. Weinberger,et al.  Unsupervised Learning of Image Manifolds by Semidefinite Programming , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[71]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[72]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[73]  Nicolas Le Roux,et al.  Efficient Non-Parametric Function Induction in Semi-Supervised Learning , 2004, AISTATS.

[74]  Shuicheng Yan,et al.  Semi-supervised Learning by Sparse Representation , 2009, SDM.

[75]  M. Cugmas,et al.  On comparing partitions , 2015 .

[76]  Shih-Fu Chang,et al.  Graph construction and b-matching for semi-supervised learning , 2009, ICML '09.