论文信息 - Fast and reliable inference of semantic clusters

Fast and reliable inference of semantic clusters

Document Indexing is but not limited to summarizing document contents with a small set of keywords or concepts of a knowledge base. Such a compact representation of document contents eases their use in numerous processes such as content-based information retrieval, corpus-mining and classification. An important effort has been devoted in recent years to (partly) automate semantic indexing, i.e. associating concepts to documents, leading to the availability of large corpora of semantically indexed documents. In this paper we introduce a method that hierarchically clusters documents based on their semantic indices while providing the proposed clusters with semantic labels. Our approach follows a neighbor joining strategy. Starting from a distance matrix reflecting the semantic similarity of documents, it iteratively selects the two closest clusters to merge them in a larger one. The similarity matrix is then updated. This is usually done by combining similarity of the two merged clusters, e.g. using the average similarity. We propose in this paper an alternative approach where the new cluster is first semantically annotated and the similarity matrix is then updated using the semantic similarity of this new annotation with those of the remaining clusters. The hierarchical clustering so obtained is a binary tree with branch lengths that convey semantic distances of clusters. It is then post-processed by using the branch lengths to keep only the most relevant clusters. Such a tool has numerous practical applications as it automates the organization of documents in meaningful clusters (e.g. papers indexed by MeSH terms, bookmarks or pictures indexed by WordNet) which is a tedious everyday task for many people. We assess the quality of the proposed methods using a specific benchmark of annotated clusters of bookmarks that were built manually. Each dataset of this benchmark has been clustered independently by several users. Remarkably, the clusters automatically built by our method are congruent with the clusters proposed by experts. All resources of this work, including source code, jar file, benchmark files and results are available at this address: http://sc.nicolasfiorini.info.

[1] Harry Bruce,et al. Better to organize personal information by folders or by tags?: The devil is in the details , 2008, ASIST.

[2] Martin Vingron,et al. Ontologizer 2.0 - a multifunctional tool for GO term enrichment analysis and data exploration , 2008, Bioinform..

[3] W. Bruce Croft,et al. Automatic recognition of reading levels from user queries , 2004, SIGIR '04.

[4] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[5] Sylvie Ranwez,et al. USI: a fast and accurate approach for conceptual document annotation , 2015, BMC Bioinformatics.

[6] Jia Zeng,et al. Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity , 2009, Bioinform..

[7] Hedi Peterson,et al. g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments , 2007, Nucleic Acids Res..

[8] Paul Pavlidis,et al. ErmineJ: Tool for functional analysis of gene expression data sets , 2005, BMC Bioinformatics.

[9] Sylvie Ranwez,et al. Coping with Imprecision During a Semi-automatic Conceptual Indexing Process , 2014, IPMU.

[10] David Carmel,et al. Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[11] Steffen Staab,et al. Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[12] G. Bharathi,et al. Study of Ontology or Thesaurus Based Document Clustering and Information Retrieval , 2012 .

[13] Renu Dhir,et al. A Frequent Concepts Based Document Clustering Algorithm , 2010 .

[14] Andreas Stafylopatis,et al. Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents , 2012, Comput. J..

[15] Fakhri Karray,et al. Enhancing Text Clustering Using Concept-based Mining Model , 2006, Sixth International Conference on Data Mining (ICDM'06).

[16] Steffen Staab,et al. WordNet improves text document clustering , 2003, SIGIR 2003.

[17] Travis D. Breaux,et al. Using Ontology in Hierarchical Information Clustering , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[18] Wei Song,et al. Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures , 2009, Expert Syst. Appl..

[19] N. H. Shah,et al. CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology , 2004, Bioinform..

[20] Fabrizio Sebastiani,et al. Cluster Generation and Labeling for Web Snippets: A Fast, Accurate Hierarchical Solution , 2006, Internet Math..

[21] Steffen Staab,et al. Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[22] Elad Yom-Tov,et al. What makes a query difficult? , 2006, SIGIR.

[23] Bernard M. E. Moret,et al. Efficiently Computing the Robinson-Foulds Metric , 2007, J. Comput. Biol..

[24] Christian Bauckhage,et al. Detecting Trends in Social Bookmarking Systems: A del.icio.us Endeavor , 2010, Int. J. Data Warehous. Min..

[25] Philip Resnik,et al. Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[26] Thomas Lengauer,et al. A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[27] Mohamed Nadif,et al. Beyond cluster labeling: Semantic interpretation of clusters' contents using a graph representation , 2014, Knowl. Based Syst..

[28] D. Robinson,et al. Comparison of phylogenetic trees , 1981 .

[29] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.