论文信息 - GDClust: A Graph-Based Document Clustering Technique

GDClust: A Graph-Based Document Clustering Technique

This paper introduces a new technique of document clustering based on frequent senses. The proposed system, GDClust (graph-based document clustering) works with frequent senses rather than frequent keywords used in traditional text mining techniques. GDClust presents text documents as hierarchical document-graphs and utilizes an apriori paradigm to find the frequent subgraphs, which reflect frequent senses. Discovered frequent subgraphs are then utilized to generate sense-based document clusters. We propose a novel multilevel Gaussian minimum support approach for candidate subgraph generation. GDClust utilizes English language ontology to construct document-graphs and exploits graph-based data mining technique for sense discovery and clustering. It is an automated system and requires minimal human interaction for the clustering purpose.

Rafal A. Angryk | M. S. Hossain

[1] Cyril Cleverdon,et al. Optimizing convenient online access to bibliographic databases , 1984 .

[2] Philip S. Yu,et al. An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[3] Lawrence B. Holder,et al. Applying the Subdue Substructure Discovery System to the Chemical Toxicity Domain , 1999, FLAIRS Conference.

[4] M Hucka,et al. Evolving a lingua franca and associated software infrastructure for computational systems biology: the Systems Biology Markup Language (SBML) project. , 2004, Systems biology.

[5] Junji Tomita,et al. Interactive Web Search by Graphical Query Refinement , 2001, WWW Posters.

[6] C. Ouzounis,et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes , 2005, Nucleic acids research.

[7] G. Miller,et al. Contextual correlates of semantic similarity , 1991 .

[8] Edith Cohen,et al. Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[9] Paul H. Lewis,et al. Content-based image retrieval with scale-space object trees , 1999, Electronic Imaging.

[10] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[11] Aviv Regev,et al. The π-calculus as an Abstraction for Biomolecular Systems , 2004 .

[12] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[13] Fu-Ren Lin,et al. Knowledge map creation and maintenance for virtual communities of practice , 2006, Inf. Process. Manag..

[14] Heikki Mannila,et al. Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[15] Pierre N. Robillard,et al. Modeling and Simulation of Molecular Biology Systems Using Petri Nets: Modeling Goals of Various Approaches , 2004, J. Bioinform. Comput. Biol..

[16] David Harel,et al. LSCs: Breathing Life into Message Sequence Charts , 1999, Formal Methods Syst. Des..

[17] Jiawei Han,et al. Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[18] Lawrence B. Holder,et al. Subdue: compression-based frequent pattern discovery in graph data , 2005 .

[19] George Karypis,et al. An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[20] Jiawei Han,et al. gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21] Kiyoko F. Aoki-Kinoshita,et al. From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[22] Ramakrishnan Srikant,et al. Fast algorithms for mining association rules , 1998, VLDB 1998.

[23] Hiroshi Motoda,et al. CLIP: Concept Learning from Inference Patterns , 1995, Artif. Intell..

[24] Ehud Gudes,et al. Diagonally Subgraphs Pattern Mining , 2004, DMKD '04.

[25] Bernard P. Zeigler,et al. Discrete Event Multi-level Models for Systems Biology , 2005, Trans. Comp. Sys. Biology.

[26] Graham J. Williams,et al. Data Mining , 2000, Communications in Computer and Information Science.

[27] E. Birney,et al. Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[28] Paul Levi,et al. An Efficient A based Algorithm for Optimal Graph Matching applied to Computer Vision , 2008 .

[29] Ramakrishnan Srikant,et al. The Quest Data Mining System , 1996, KDD.

[30] Jiawei Han,et al. Data Mining: Concepts and Techniques , 2000 .

[31] Mark E. J. Newman,et al. The Structure and Function of Complex Networks , 2003, SIAM Rev..

[32] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[33] Vipin Kumar,et al. Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[34] Catherine M Lloyd,et al. CellML: its future, present and past. , 2004, Progress in biophysics and molecular biology.

[35] Alexander E. Kel,et al. TRANSPATH®: an information resource for storing and visualizing signaling pathways and their pathological aberrations , 2005, Nucleic Acids Res..