GDClust: A Graph-Based Document Clustering Technique

This paper introduces a new technique of document clustering based on frequent senses. The proposed system, GDClust (graph-based document clustering) works with frequent senses rather than frequent keywords used in traditional text mining techniques. GDClust presents text documents as hierarchical document-graphs and utilizes an apriori paradigm to find the frequent subgraphs, which reflect frequent senses. Discovered frequent subgraphs are then utilized to generate sense-based document clusters. We propose a novel multilevel Gaussian minimum support approach for candidate subgraph generation. GDClust utilizes English language ontology to construct document-graphs and exploits graph-based data mining technique for sense discovery and clustering. It is an automated system and requires minimal human interaction for the clustering purpose.

[1]  Cyril Cleverdon,et al.  Optimizing convenient online access to bibliographic databases , 1984 .

[2]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[3]  Lawrence B. Holder,et al.  Applying the Subdue Substructure Discovery System to the Chemical Toxicity Domain , 1999, FLAIRS Conference.

[4]  M Hucka,et al.  Evolving a lingua franca and associated software infrastructure for computational systems biology: the Systems Biology Markup Language (SBML) project. , 2004, Systems biology.

[5]  Junji Tomita,et al.  Interactive Web Search by Graphical Query Refinement , 2001, WWW Posters.

[6]  C. Ouzounis,et al.  Expansion of the BioCyc collection of pathway/genome databases to 160 genomes , 2005, Nucleic acids research.

[7]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[8]  Edith Cohen,et al.  Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[9]  Paul H. Lewis,et al.  Content-based image retrieval with scale-space object trees , 1999, Electronic Imaging.

[10]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[11]  Aviv Regev,et al.  The π-calculus as an Abstraction for Biomolecular Systems , 2004 .

[12]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[13]  Fu-Ren Lin,et al.  Knowledge map creation and maintenance for virtual communities of practice , 2006, Inf. Process. Manag..

[14]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[15]  Pierre N. Robillard,et al.  Modeling and Simulation of Molecular Biology Systems Using Petri Nets: Modeling Goals of Various Approaches , 2004, J. Bioinform. Comput. Biol..

[16]  David Harel,et al.  LSCs: Breathing Life into Message Sequence Charts , 1999, Formal Methods Syst. Des..

[17]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[18]  Lawrence B. Holder,et al.  Subdue: compression-based frequent pattern discovery in graph data , 2005 .

[19]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[20]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[22]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[23]  Hiroshi Motoda,et al.  CLIP: Concept Learning from Inference Patterns , 1995, Artif. Intell..

[24]  Ehud Gudes,et al.  Diagonally Subgraphs Pattern Mining , 2004, DMKD '04.

[25]  Bernard P. Zeigler,et al.  Discrete Event Multi-level Models for Systems Biology , 2005, Trans. Comp. Sys. Biology.

[26]  Graham J. Williams,et al.  Data Mining , 2000, Communications in Computer and Information Science.

[27]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[28]  Paul Levi,et al.  An Efficient A based Algorithm for Optimal Graph Matching applied to Computer Vision , 2008 .

[29]  Ramakrishnan Srikant,et al.  The Quest Data Mining System , 1996, KDD.

[30]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[31]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[32]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[33]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[34]  Catherine M Lloyd,et al.  CellML: its future, present and past. , 2004, Progress in biophysics and molecular biology.

[35]  Alexander E. Kel,et al.  TRANSPATH®: an information resource for storing and visualizing signaling pathways and their pathological aberrations , 2005, Nucleic Acids Res..