Phrase-based document similarity based on an index graph model

Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the document index graph, which indexes web documents based on phrases, rather than single terms only. The semi-structured web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The document index graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The similarity between documents is based on both single term weights and matching phrases weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, enhances web document clustering quality significantly.

[1]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[2]  Markus Junker,et al.  Learning for Text Categorization and Information Extraction with ILP , 1999, Learning Language in Logic.

[3]  King-Sun Fu,et al.  A Sentence-to-Sentence Clustering Procedure for Pattern Analysis , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[5]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[6]  T. Kohonen,et al.  Workshop on Self-Organizing Maps (WSOM'97), Espoo, Finland, June 4-6, 1997 , 1997 .

[7]  Javed A. Aslam,et al.  Investigating Measures for Pairwise Document Similarity , 1999 .

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[12]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[15]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[16]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[17]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[18]  Raymond J. Mooney,et al.  A Mutually Beneficial Integration of Data Mining and Information Extraction , 2000, AAAI/IAAI.

[19]  Evelyn C. Ferstl Learning from Text , 2001 .

[20]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[21]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[22]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.