Predicting content from hyperlinks

document to predict if the user is going to click on the hyperlink. We represent the documents the hyperlinks point to as word-vectors using the bagof-words representation as commonly used in learning on text data (Joachims 1997, Mladeni c 1998a, Pazzani and Billsus 1997). This enables using machine learning techniques to predict the frequency of words in the vector. We simplify this problem by predicting word occurrence instead of frequency and reducing the number of words we want to predict. The number of words is reduced by removing words from English \stop-list" and removing infrequent words as commonly performed when learning on text data (Cohen 1995, Joachims 1997). The problem of predicting word occurrence is illustrated in Figure 2 giving a hyperlink and the document is points to, as well as the set of words predicted to occur in the document based on the content of the given hyperlink. One of the main characteristics of this machine learning problem is having a vector of class attributes instead of only one class attribute, where each of the vector elements corresponds to the occurrence of a particular word in the document. Since a document gives a context for words occurring in it, the occurrence of one word is possibly dependent on other words in the document. Thus, we are dealing with the problem of mutually dependent class attributes. Collecting data for this problem relatively easy, since we are learning from unlabeled data available on the Web. 3 Learning algorithm We propose a simple modi cation of the k-Nearest Neighbor algorithm (Aha, Kibler and Albert 1991) to be used for learning on mutually dependent class attributes. We use it in our experiments on predicting word occurrences based on the hyperlink that points to the document. In order to apply k-Nearest Neighbor algorithm on mutually dependent class attributes we modi ed the algorithm to handle examples that consist of HyperLinkDoc pairs, where HyperLink includes the actual hyperlink from HTML-document and some of its context and Doc is a text document pointed to by the hyperlink. When applied on the Yahoo hierarchy (Yahoo 1997), each hyperlink description HyperLink we build to serve as a part of machine learning example includes all the text in the hyperlink item included in the Yahoo document and the words from the document category name (written on the top of the Yahoo document that includes the hyperlink). The idea is that the algorithm nds a group of neighboring documents that are `close' when the corresponding HyperLinks are similar. HyperLink and Doc are represented by two wordvectors. It is possible that multiple hyperlinks point to the same document. In this case, multiple copies of the document are included and no connection between them is made. The document is considered as neighboring if any of the hyperlinks pointing to it is one of the nearest neighbors of the hyperlink under classi cation. The similarity of two HyperLinks is measured by cosine similarity between their word-vectors. cos ( ~ X; ~ Y ) = PiXiYi qPj X2 j Pl Y 2 l Notice that our `neighbor' relation is not transitive. Namely, if two hyperlinks H1 and H2 are neighboring, there can be the third hyperlinkH3 that is neighboring to only one of them, eg., H1. This is due to the fact that the reason that two hyperlinks are neighboring are their common words. The common words of H1 and H2 can be completely di erent from the common words of H1 and H3, and at the same time there can be an empty set of the common words of H2 and H3. The detected neighboring documents are then represented by an abstract document whose word occurrence is computed as an average of word occurrences of the neighboring documents. For each class (word occurrence) a threshold is set for mapping to a binaryvalued class such that length of the abstract (predicted) document is as close as possible to the average length of neighboring documents from which it was generated. Figure 1 illustrates the construction of the abstract document. Mutual dependence of class attributes in uences the value of Threshold set such that Length(Abs:Doc) : = 1 kPki Length(Exi:Doc), where Abs:Doc is the vector representing abstract document. 4 Experiments 4.1 Domain description The procedure for getting domain data from the Yahoo domains given in Table 1 is the following. From all the examples included in the text hierarchy, 500 were randomly selected and used in 5-fold cross validation. Each example consists of two parts: hyperlink description and document the hyperlink points to. The hyperlink part includes an item from the Yahoo document and the corresponding category name this Yahoo document represents. The second part of example is the actual document pointed to by the hy-