CLUSTERING 3 Descriptive Document Clustering via Discriminant Learning in a Co-embedded Space of Multi-level Similarities

Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarise the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of two types of heterogeneous objects, that correspond to documents and candidate phrases, using multi-level similarity information. CEDL is composed of five main processing stages. Firstly, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order neighbour-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme utilising multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multi-topic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields. DESCRIPTIVE DOCUMENT CLUSTERING 3 Descriptive Document Clustering via Discriminant Learning in a Co-embedded Space of Multi-level Similarities

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  David R. Karger,et al.  Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections , 2017, SIGF.

[3]  Oren Etzioni,et al.  Clustering web documents: a phrase-based method for grouping search engine results , 1999 .

[4]  Tingting Mu,et al.  Automatic Generation of Co-Embeddings from Relational Data with Adaptive Shaping , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Balázs Kovács,et al.  A generalized model of relational similarity , 2010, Soc. Networks.

[6]  Sumanta Guha,et al.  Semantic Suffix Tree Clustering , 2010 .

[7]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[8]  Timothy Cribbin,et al.  Discovering latent topical structure by second-order similarity analysis , 2011, J. Assoc. Inf. Sci. Technol..

[9]  Samuel Kaski,et al.  Keyword selection method for characterizing text document maps , 1999 .

[10]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[11]  Dawid Weiss,et al.  Extending k-means with the description comes first approach , 2007 .

[12]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[13]  Gal Chechik,et al.  Euclidean Embedding of Co-occurrence Data , 2004, J. Mach. Learn. Res..

[14]  Peng Jiang,et al.  A K-means Approach Based on Concept Hierarchical Tree for Search Results Clustering , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[15]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[16]  Gilles Bisson,et al.  An Improved Co-Similarity Measure for Document Clustering , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[17]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[18]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[19]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[20]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[21]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[22]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[23]  Vijay V. Raghavan,et al.  Document Clustering, Visualization, and Retrieval via Link Mining , 2004 .

[24]  S. Dongen A cluster algorithm for graphs , 2000 .

[25]  Ujwala Bharambe,et al.  Landscape of Web Search Results Clustering Algorithms , 2011 .

[26]  Anupam Joshi,et al.  Retriever: Improving Web Search Engine Results Using Clustering , 2000 .

[27]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[28]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[29]  James P. Callan,et al.  Automatically labeling hierarchical clusters , 2006, DG.O.

[30]  R. Krishnapuram,et al.  A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[31]  Worapoj Kreesuradej,et al.  A New Web Search Result Clustering based on True Common Phrase Label Discovery , 2006, 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06).

[32]  Ali Ridho Barakbah,et al.  Hierarchical K-means: an algorithm for centroids initialization for K-means , 2007 .

[33]  Xindong Wu,et al.  A new descriptive clustering algorithm based on Nonnegative Matrix Factorization , 2008, 2008 IEEE International Conference on Granular Computing.

[34]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[35]  Mykola Galushka,et al.  A scaleable document clustering approach for large document corpora , 2006, Inf. Process. Manag..

[36]  Samuel Kaski,et al.  Mining massive document collections by the WEBSOM method , 2004, Inf. Sci..

[37]  Yuen-Hsien Tseng,et al.  Generic title labeling for clustered documents , 2010, Expert Syst. Appl..

[38]  Chun-Houh Chen GENERALIZED ASSOCIATION PLOTS: INFORMATION VISUALIZATION VIA ITERATIVELY GENERATED CORRELATION MATRICES , 2002 .

[39]  Jiming Liu,et al.  Learning Topic Models by Belief Propagation , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Edward Hung,et al.  Fuzzy clustering and relevance ranking of web search results with differentiating cluster label generation , 2010, International Conference on Fuzzy Systems.

[41]  C.-M. Chen,et al.  Classification and visualization of the social science network by the minimum span clustering method , 2011, J. Assoc. Inf. Sci. Technol..

[42]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[43]  Michael Biehl,et al.  A General Framework for Dimensionality-Reducing Data Visualization Mapping , 2012, Neural Computation.

[44]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[45]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Timothy W. Finin,et al.  Wikipedia as an Ontology for Describing Documents , 2008, ICWSM.

[47]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[48]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[49]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[50]  Tingting Mu,et al.  Adaptive Data Embedding Framework for Multiclass Classification , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[51]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Ivor W. Tsang,et al.  Discovering Low-Rank Shared Concept Space for Adapting Text Mining Models , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Palma Blonda,et al.  A survey of fuzzy clustering algorithms for pattern recognition. II , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[54]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[55]  Sophia Ananiadou,et al.  Applications of text mining within systematic reviews , 2011, Research synthesis methods.

[56]  Gordon W. Paynter,et al.  Automatic extraction of document keyphrases for use in digital libraries: Evaluation and applications , 2002, J. Assoc. Inf. Sci. Technol..

[57]  Benxiong Huang,et al.  Web Search Results Clustering Based on a Novel Suffix Tree Structure , 2008, ATC.

[58]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[59]  Fabrizio Sebastiani,et al.  Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution , 2006, SPIRE.

[60]  Ming Ouyang,et al.  Gaussian mixture clustering and imputation of microarray data , 2004, Bioinform..

[61]  Tingting Mu,et al.  Proximity-Based Frameworks for Generating Embeddings from Multi-Output Data , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Lefteris Angelis,et al.  PuReD-MCL: a graph-based PubMed document clustering methodology , 2008, Bioinform..

[63]  Stanislaw Osinski Improving Quality of Search Results Clustering with Approximate Matrix Factorisations , 2006, ECIR.

[64]  Amanda Spink,et al.  Web searching on the Vivisimo search engine , 2006, J. Assoc. Inf. Sci. Technol..

[65]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[66]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[67]  Sophia Ananiadou,et al.  Supporting Systematic Reviews Using Text Mining , 2009 .

[68]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[69]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[70]  Tingting Mu,et al.  ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials , 2012, BMC Medical Informatics and Decision Making.

[71]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[72]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[73]  David Dubin,et al.  The Most Influential Paper Gerard Salton Never Wrote , 2004, Libr. Trends.

[74]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.