Web document classification using modified decision trees

Searching for Web pages is one of the most common tasks performed on the Web while Web page classification is the first step for Web search service construction. This paper proposes a method for classifying Web documents by using a height-three modified decision tree which splits the root, depth-one nodes, and depth-two nodes based on keywords, descriptions, and hyperlinks, respectively. A classification starts with a Web page at the root of the decision tree and traces paths downward to leaves, which give the categories of the page.