Enhanced hypertext categorization using hyperlinks

A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

[1]  Gerard Salton,et al.  Associative Document Retrieval Techniques Using Bibliographic Information , 1963, JACM.

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Kui-Lam Kwok The use of title and cited titles as document representation for automatic classification , 1975, Inf. Process. Manag..

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Steven W. Zucker,et al.  On the Foundations of Relaxation Labeling Processes , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[8]  Kui-Lam Kwok,et al.  A Document-Document Similarity Measure Based on Cited Titles and Probability Theory, and Its Application to Relevance Feedback Retrieval , 1984, SIGIR.

[9]  Kui-Lam Kwok,et al.  A probabilistic theory of indexing and similarity measure based on cited and citing documents , 1985, J. Am. Soc. Inf. Sci..

[10]  Y. Zhang,et al.  Enhancement of text representations using related document titles , 1986, Inf. Process. Manag..

[11]  Kui-Lam Kwok,et al.  On the use of bibliographically related titles for the enhancement of document representations , 1988, Inf. Process. Manag..

[12]  W. Bruce Croft,et al.  A retrieval model incorporating hypertext links , 1989, Hypertext.

[13]  Stephen Muggleton,et al.  Efficient Induction of Logic Programs , 1990, ALT.

[14]  Lionel Pelkowitz,et al.  A continuous relaxation labeling algorithm for Markov random fields , 1990, IEEE Trans. Syst. Man Cybern..

[15]  Michael J. Pazzani,et al.  A Knowledge-intensive Approach to Learning Relational Concepts , 1991, ML.

[16]  Hans-Peter Frei,et al.  Making use of hypertext links when retrieving information , 1992, ECHT '92.

[17]  Dario Lucarella,et al.  Information Retrieval from Hypertext: An Approach Using Plausible Inference , 1993, Inf. Process. Manag..

[18]  W. Bruce Croft,et al.  Retrieval Strategies for Hypertext , 1993, Inf. Process. Manag..

[19]  Anil K. Jain,et al.  Markov random fields : theory and application , 1993 .

[20]  Jacques Savoy,et al.  A Learning Scheme for Information Retrieval in Hypertext , 1994, Inf. Process. Manag..

[21]  Relaxation labeling of Markov random fields , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[22]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[23]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[24]  Dana Ron,et al.  Learning to model sequences generated by switching distributions , 1995, COLT '95.

[25]  Jacques Savoy,et al.  A new probabilistic scheme for information retrieval in hypertext , 1995, New Rev. Hypermedia Multim..

[26]  H. P. Frei,et al.  The use of semantic links in hypertext information retrieval , 1995 .

[27]  Hans-Peter Frei,et al.  The Use of Semantic Links in Hypertext Information Retrieval , 1995, Inf. Process. Manag..

[28]  Jacques Savoy,et al.  Citation Schemes in Hypertext Information Retrieval , 1996 .

[29]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[30]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[31]  Jacques Savoy,et al.  An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems , 1996, Inf. Process. Manag..

[32]  S. Sitharama Iyengar,et al.  A New Probabilistic Relaxation Scheme and Its Application to Edge Detection , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[34]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[35]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[36]  Marti A. Hearst,et al.  Cat-a-Cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy , 1997, SIGIR '97.

[37]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[38]  K. Kubik,et al.  Stereo image matching based on probability relaxation , 1997, TENCON '97 Brisbane - Australia. Proceedings of IEEE TENCON '97. IEEE Region 10 Annual Conference. Speech and Image Technologies for Computing and Telecommunications (Cat. No.97CH36162).

[39]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[40]  Shih-Fu Chang,et al.  Visually Searching the Web for Content , 1997, IEEE Multim..

[41]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[42]  David Eppstein,et al.  Finding the k Shortest Paths , 1999, SIAM J. Comput..

[43]  William W. Cohen,et al.  Context-sensitive learning methods for text categorization , 1999, TOIS.

[44]  Gobinda G. Chowdhury,et al.  Introduction to Modern Information Retrieval , 1999 .