Fast and reliable inference of semantic clusters

Document Indexing is but not limited to summarizing document contents with a small set of keywords or concepts of a knowledge base. Such a compact representation of document contents eases their use in numerous processes such as content-based information retrieval, corpus-mining and classification. An important effort has been devoted in recent years to (partly) automate semantic indexing, i.e. associating concepts to documents, leading to the availability of large corpora of semantically indexed documents. In this paper we introduce a method that hierarchically clusters documents based on their semantic indices while providing the proposed clusters with semantic labels. Our approach follows a neighbor joining strategy. Starting from a distance matrix reflecting the semantic similarity of documents, it iteratively selects the two closest clusters to merge them in a larger one. The similarity matrix is then updated. This is usually done by combining similarity of the two merged clusters, e.g. using the average similarity. We propose in this paper an alternative approach where the new cluster is first semantically annotated and the similarity matrix is then updated using the semantic similarity of this new annotation with those of the remaining clusters. The hierarchical clustering so obtained is a binary tree with branch lengths that convey semantic distances of clusters. It is then post-processed by using the branch lengths to keep only the most relevant clusters. Such a tool has numerous practical applications as it automates the organization of documents in meaningful clusters (e.g. papers indexed by MeSH terms, bookmarks or pictures indexed by WordNet) which is a tedious everyday task for many people. We assess the quality of the proposed methods using a specific benchmark of annotated clusters of bookmarks that were built manually. Each dataset of this benchmark has been clustered independently by several users. Remarkably, the clusters automatically built by our method are congruent with the clusters proposed by experts. All resources of this work, including source code, jar file, benchmark files and results are available at this address: http://sc.nicolasfiorini.info.

[1]  Harry Bruce,et al.  Better to organize personal information by folders or by tags?: The devil is in the details , 2008, ASIST.

[2]  Martin Vingron,et al.  Ontologizer 2.0 - a multifunctional tool for GO term enrichment analysis and data exploration , 2008, Bioinform..

[3]  W. Bruce Croft,et al.  Automatic recognition of reading levels from user queries , 2004, SIGIR '04.

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  Sylvie Ranwez,et al.  USI: a fast and accurate approach for conceptual document annotation , 2015, BMC Bioinformatics.

[6]  Jia Zeng,et al.  Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity , 2009, Bioinform..

[7]  Hedi Peterson,et al.  g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments , 2007, Nucleic Acids Res..

[8]  Paul Pavlidis,et al.  ErmineJ: Tool for functional analysis of gene expression data sets , 2005, BMC Bioinformatics.

[9]  Sylvie Ranwez,et al.  Coping with Imprecision During a Semi-automatic Conceptual Indexing Process , 2014, IPMU.

[10]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[11]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[12]  G. Bharathi,et al.  Study of Ontology or Thesaurus Based Document Clustering and Information Retrieval , 2012 .

[13]  Renu Dhir,et al.  A Frequent Concepts Based Document Clustering Algorithm , 2010 .

[14]  Andreas Stafylopatis,et al.  Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents , 2012, Comput. J..

[15]  Fakhri Karray,et al.  Enhancing Text Clustering Using Concept-based Mining Model , 2006, Sixth International Conference on Data Mining (ICDM'06).

[16]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[17]  Travis D. Breaux,et al.  Using Ontology in Hierarchical Information Clustering , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[18]  Wei Song,et al.  Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures , 2009, Expert Syst. Appl..

[19]  N. H. Shah,et al.  CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology , 2004, Bioinform..

[20]  Fabrizio Sebastiani,et al.  Cluster Generation and Labeling for Web Snippets: A Fast, Accurate Hierarchical Solution , 2006, Internet Math..

[21]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[22]  Elad Yom-Tov,et al.  What makes a query difficult? , 2006, SIGIR.

[23]  Bernard M. E. Moret,et al.  Efficiently Computing the Robinson-Foulds Metric , 2007, J. Comput. Biol..

[24]  Christian Bauckhage,et al.  Detecting Trends in Social Bookmarking Systems: A del.icio.us Endeavor , 2010, Int. J. Data Warehous. Min..

[25]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[26]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[27]  Mohamed Nadif,et al.  Beyond cluster labeling: Semantic interpretation of clusters' contents using a graph representation , 2014, Knowl. Based Syst..

[28]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[29]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[30]  Alexander Maedche,et al.  Clustering Ontology-Based Metadata in the Semantic Web , 2002, PKDD.

[31]  Sampsa Hautaniemi,et al.  Fast Gene Ontology based clustering for microarray experiments , 2008, BioData Mining.

[32]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[33]  Sylvie Ranwez,et al.  The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies , 2014, Bioinform..

[34]  Martin Kuiper,et al.  BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks , 2005, Bioinform..

[35]  Günter Neumann,et al.  Context-aware semantic classification of search queries for browsing community question-answering archives , 2016, Knowl. Based Syst..

[36]  T. Speed,et al.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes. , 2004, Bioinformatics.

[37]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[38]  Pierre Andrews,et al.  Semantic Disambiguation in Folksonomy: A Case Study , 2009, NLP4DL/AT4DL.

[39]  Sylvie Ranwez,et al.  Semantic Similarity from Natural Language and Ontology Analysis , 2015, Synthesis Lectures on Human Language Technologies.

[40]  Anushya Muruganujan,et al.  PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium , 2009, Nucleic Acids Res..

[41]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[42]  Iraklis Varlamis,et al.  Semantic smoothing for text clustering , 2013, Knowl. Based Syst..

[43]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[44]  Olivier Bodenreider,et al.  Besides Precision & Recall: Exploring Alternative Approaches to Evaluating an Automatic Indexing Tool for MEDLINE , 2006, AMIA.

[45]  Boris Adryan,et al.  Gene-Ontology-based clustering of gene expression data , 2004, Bioinform..

[46]  Taku A. Tokuyasu,et al.  EGAN: exploratory gene association networks , 2010, Bioinform..

[47]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[48]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[49]  Steffen Staab,et al.  Text clustering based on good aggregations , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[50]  Judit Bar-Ilan,et al.  Folder versus tag preference in personal information management , 2013, J. Assoc. Inf. Sci. Technol..

[51]  Xin Wang,et al.  Towards Semantically Sensitive Text Clustering: A Feature Space Modeling Technology Based on Dimension Extension , 2015, PloS one.

[52]  Sylvie Ranwez,et al.  USI at BioASQ 2015: a Semantic Similarity-based Approach for Semantic Indexing , 2015, CLEF.

[53]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[54]  Pádraig Cunningham,et al.  Ontology Discovery for the Semantic Web Using Hierarchical Clustering , 2002 .