Data mining using concepts of independence, unimodality and homophily

With the widespread use of information technologies, more and more complex data is generated and collected every day. Such complex data is various in structure, size, type and format, e.g. time series, texts, images, videos and graphs. Complex data is often high-dimensional and heterogeneous, which makes the separation of the wheat (knowledge) from the chaff (noise) more difficult. Clustering is a main mode of knowledge discovery from complex data, which groups objects in such a way that intra-group objects are more similar than inter-group objects. Traditional clustering methods such as k-means, Expectation-Maximization clustering (EM), DBSCAN and spectral clustering are either deceived by "the curse of dimensionality" or spoiled by heterogenous information. So, how to effectively explore complex data? In some cases, people may only have some partial information about the complex data. For example, in social networks, not every user provides his/her profile information such as the personal interests. Can we leverage the limited user information and friendship network wisely to infer the likely labels of the unlabeled users so that the advertisers can do accurate advertising? This is the problem of learning from labeled and unlabeled data, which is literarily attributed to semi-supervised classification. To gain insights into these problems, this thesis focuses on developing clustering and semi-supervised classification methods that are driven by the concepts of independence, unimodality and homophily. The proposed methods leverage techniques from diverse areas, such as statistics, information theory, graph theory, signal processing, optimization and machine learning. Specifically, this thesis develops four methods, i.e. FUSE, ISAAC, UNCut, and wvGN. FUSE and ISAAC are clustering techniques to discover statistically independent patterns from high-dimensional numerical data. UNCut is a clustering technique to discover unimodal clusters in attributed graphs in which not all the attributes are relevant to the graph structure. wvGN is a semi-supervised classification technique using the theory of homophily to infer the labels of the unlabeled vertices in graphs. We have verified our clustering and semi-supervised classification methods on various synthetic and real-world data sets. The results are superior to those of the state-of-the-art.

[1]  Jie Tang,et al.  SOCIAL NETWORK DATA ANALYTICS SOCIAL NETWORK DATA ANALYTICS , 2010 .

[2]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[4]  Andrew McCallum,et al.  Introduction to Statistical Relational Learning , 2007 .

[5]  Klemens Böhm,et al.  Efficient Algorithms for a Robust Modularity-Driven Clustering of Attributed Graphs , 2015, SDM.

[6]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[7]  Jian Pei,et al.  Finding multiple stable clusterings , 2016, Knowledge and Information Systems.

[8]  Frank Lin,et al.  Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning , 2012 .

[9]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[10]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[11]  Elke Achtert,et al.  Evaluation of Clusterings -- Metrics and Visual Support , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[12]  Ira Assent,et al.  INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[13]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[14]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[15]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[16]  F. Radicchi,et al.  Benchmark graphs for testing community detection algorithms. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[18]  Hao Huang,et al.  Diverse Power Iteration Embeddings and Its Applications , 2014, 2014 IEEE International Conference on Data Mining.

[19]  Fan Chung,et al.  The heat kernel as the pagerank of a graph , 2007, Proceedings of the National Academy of Sciences.

[20]  Ying Cui,et al.  Non-redundant Multi-view Clustering via Orthogonalization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[21]  Wei Ye,et al.  Generalized Independent Subspace Clustering , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[22]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[23]  Bernhard Liebl,et al.  Very high compliance in an expanded MS-MS-based newborn screening program despite written parental consent. , 2002, Preventive medicine.

[24]  Young-Koo Lee,et al.  Deflation-based power iteration clustering , 2013, Applied Intelligence.

[25]  Christian Böhm,et al.  Attributed Graph Clustering with Unimodal Normalized Cut , 2017, ECML/PKDD.

[26]  Foster Provost,et al.  A Simple Relational Classifier , 2003 .

[27]  Andrew W. Moore,et al.  Nonparametric Density Estimation: Toward Computational Tractability , 2003, SDM.

[28]  Christos Faloutsos,et al.  PICS: Parameter-free Identification of Cohesive Subgroups in Large Attributed Graphs , 2012, SDM.

[29]  Christos Faloutsos,et al.  Using ghost edges for classification in sparsely labeled networks , 2008, KDD.

[30]  Huan Liu,et al.  Scalable learning of collective behavior based on sparse social dimensions , 2009, CIKM.

[31]  Christian Böhm,et al.  Learning from Labeled and Unlabeled Vertices in Networks , 2017, KDD.

[32]  Ichigaku Takigawa,et al.  A spectral clustering approach to optimally combining numericalvectors with a modular network , 2007, KDD '07.

[33]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[34]  Christian Böhm,et al.  Outlier-robust clustering using independent components , 2008, SIGMOD Conference.

[35]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[36]  Barnabás Póczos,et al.  ICA and ISA using Schweizer-Wolff measure of dependence , 2008, ICML '08.

[37]  James Bailey,et al.  COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[38]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[39]  Gita Reese Sukthankar,et al.  Multi-label relational neighbor classification using social context features , 2013, KDD.

[40]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[41]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[42]  Zhenguo Li,et al.  Noise Robust Spectral Clustering , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[43]  Ira Assent,et al.  Relevant Subspace Clustering: Mining the Most Interesting Non-redundant Concepts in High Dimensional Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[44]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[45]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[46]  Foster Provost,et al.  Relational Learning Problems and Simple Models , 2003 .

[47]  Xuan Vinh Nguyen,et al.  minCEntropy: A Novel Information Theoretic Approach for the Generation of Alternative Clusterings , 2010, 2010 IEEE International Conference on Data Mining.

[48]  Emmanuel Müller,et al.  Focused clustering and outlier detection in large attributed graphs , 2014, KDD.

[49]  Fei Wang,et al.  Label Propagation through Linear Neighborhoods , 2008, IEEE Trans. Knowl. Data Eng..

[50]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[51]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[52]  William W. Cohen,et al.  A Very Fast Method for Clustering Big Text Datasets , 2010, ECAI.

[53]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[54]  Yizhou Sun,et al.  Personalized entity recommendation: a heterogeneous information network approach , 2014, WSDM.

[55]  Thomas Seidl,et al.  Multi-view clustering using mixture models in subspace projections , 2012, KDD.

[56]  Dorothea Wagner,et al.  Between Min Cut and Graph Bisection , 1993, MFCS.

[57]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[58]  Andreas Krause Multimodal Projection Pursuit using the Dip Statistic , 2005 .

[59]  Thomas Seidl,et al.  Subspace Clustering Meets Dense Subgraph Mining: A Synthesis of Two Paradigms , 2010, 2010 IEEE International Conference on Data Mining.

[60]  Thomas Seidl,et al.  SMVC: semi-supervised multi-view clustering in subspace projections , 2014, KDD.

[61]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[62]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[63]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[64]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[65]  John W. Fisher,et al.  ICA Using Spacings Estimates of Entropy , 2003, J. Mach. Learn. Res..

[66]  Christian Böhm,et al.  FUSE: Full Spectral Clustering , 2016, KDD.

[67]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[68]  H. Abdi,et al.  Principal component analysis , 2010 .

[69]  María Vanrell,et al.  Texton theory revisited: A bag-of-words approach to combine textons , 2012, Pattern Recognit..

[70]  James Bailey,et al.  A hierarchical information theoretic technique for the discovery of non linear alternative clusterings , 2010, KDD.

[71]  Martin Ester,et al.  Mining Cohesive Patterns from Graphs with Feature Vectors , 2009, SDM.

[72]  J. Moody Race, School Integration, and Friendship Segregation in America1 , 2001, American Journal of Sociology.

[73]  Shaogang Gong,et al.  Spectral clustering with eigenvector selection , 2008, Pattern Recognit..

[74]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[75]  Klemens Böhm,et al.  Statistical Selection of Congruent Subspaces for Mining Attributed Graphs , 2013, 2013 IEEE 13th International Conference on Data Mining.

[76]  Morroe Berger,et al.  Freedom and control in modern society , 1954 .

[77]  Thomas Seidl,et al.  Spectral Subspace Clustering for Graphs with Feature Vectors , 2013, 2013 IEEE 13th International Conference on Data Mining.

[78]  Jorma Rissanen,et al.  Information and Complexity in Statistical Modeling , 2006, ITW.

[79]  Jian Pei,et al.  Within-Network Classification Using Radius-Constrained Neighborhood Patterns , 2014, CIKM.

[80]  Peter Lindstrom,et al.  Locally-scaled spectral clustering using empty region graphs , 2012, KDD.

[81]  Klemens Böhm,et al.  Ranking outlier nodes in subspaces of attributed graphs , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[82]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[83]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[84]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[85]  William W. Cohen,et al.  Semi-Supervised Classification of Network Data Using Very Few Labels , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[86]  Meirav Galun,et al.  Fundamental Limitations of Spectral Clustering , 2006, NIPS.

[87]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[88]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[89]  Olvi L. Mangasarian,et al.  A finite newton method for classification , 2002, Optim. Methods Softw..

[90]  M. Narasimha Murty,et al.  Structural Neighborhood Based Classification of Nodes in a Network , 2016, KDD.

[91]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[92]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .