Semantic indexing of hybrid frequent pattern-based clustering of documents with missing semantic information

Documents added recently to the web are augmented with semantic information to identify the class of the documents, i.e., the topic or concept to which document belongs to, can be identified explicitly using meta tags like keyword tags or rich data format RDF tags. But the documents that enriched the web five or ten years back do not contain semantic information. In this paper, we present hybrid clustering system using frequent pattern mining HCSFPM technique which fuses the two frequent pattern mining schemes: frequent term-based and frequent pattern-based techniques to cluster the documents according to topics or concepts. We also index the documents based on the semantic information content of the document. Results illustrate that HCSFPM method performs better than the traditional term-based method.

[1]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[3]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[4]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[5]  Yuefeng Li,et al.  Effective Pattern Discovery for Text Mining , 2012, IEEE Transactions on Knowledge and Data Engineering.

[6]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[7]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[8]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[9]  Mohamed S. Kamel,et al.  Distributed collaborative Web document clustering using cluster keyphrase summaries , 2008, Inf. Fusion.

[10]  George A. Miller,et al.  Nouns in WordNet: A Lexical Inheritance System , 1990 .

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[12]  Wolfgang Nejdl,et al.  PCIR: Combining DHTs and peer clusters for efficient full-text P2P indexing , 2010, Comput. Networks.

[13]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).