Extracting classification knowledge of Internet documents with mining term associations: a semantic approach

In this paper, we present a system that extracts and generalizes terms from Internet documents to represent classification knowledge of a given class hierarchy. We propose a measurement to evaluate the importance of a term with respect to a class in the class hierarchy, and denote it as support. With a given threshold, terms with high supports are sifted as keywords of a class, and terms with low supports are filtered out. To further enhance the recall of this approach, Mining Association Rules technique is applied to mine the association between terms. An inference model is composed of these association relations and the previously computed supports of the terms in the class. To increase the recall rate of the keyword selection process. we then present a polynomialtime inference algorithm to promote a term, strongly associated to a known keyword, to a keyword. According to our experiment results on the collected Internet documents from Yam search engine, we show that the proposed methods In the paper contribute to refine the classification knowledge and increase the recall of keyword selection.

[1]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[2]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[3]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[4]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[5]  Gaston H. Gonnet,et al.  Unstructured data bases or very efficient text searching , 1983, PODS.

[6]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[7]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[8]  Gary Chartrand,et al.  Applied and algorithmic graph theory , 1992 .

[9]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[10]  Gerald Salton,et al.  Automatic text processing , 1988 .

[11]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[12]  Javed Mostafa,et al.  A multilevel approach to intelligent information filtering: model, system, and evaluation , 1997, TOIS.

[13]  Dennis Shasha,et al.  New techniques for best-match retrieval , 1990, TOIS.

[14]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[15]  Dik Lun Lee,et al.  A World Wide Web Resource Discovery System , 1995, World Wide Web J..

[16]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.