Automatic Web-Page Classification by Using Machine Learning Methods

This paper describes automatic Web-page classification by using machine learning methods. Recently, the importance of portal site services is increasing including the search engine function on World Wide Web. Especially, the portal site such as Yahoo! service, which hierarchically classifies Web-pages into many categories, is becoming popular. However, the classification of Web-page into each category relies on man power, which costs much time and care. To alleviate this problem, we propose techniques to generate attributes by using co-occurrence analysis and to classify Web-page automatically based on machine learning. We apply these techniques to Web-pages on Yahoo! JAPAN and construct decision trees, which determine appropriate category for each Web-page. The performance of this proposed method is evaluated in terms of error rate, recall, and precision. The experimental evaluation demonstrates that this method provides acceptable accuracy with the classification of Web-page into top level categories on Yahoo! JAPAN.