论文信息 - Automated Blog Classification: Challenges and Pitfalls

Automated Blog Classification: Challenges and Pitfalls

Blogs are difficult to categorize by humans and machines alike, because they are written in a capricious style. In the early days of web, directories maintain by humans could not keep up millions the websites; likewise, blog directories cannot keep up with the explosive growth of the blogsphere. This paper investigates the efficacy of using machine learning to categorize blogs. We design a text classification experiment to categorize one hundred and twenty blogs into four topics: personal diary, news, political, and sports. The baseline feature is unigrams weighed by TF-IDF, which yielded 84% accuracy. We analyze the corpus, features, and result data. Our analysis leads us to believe that blog taxonomies need to support polyhierarchy—a given blog may be correctly classified under more than one category.

Hong Qu | Sarah S. Poon | Andrea La Pietra

[1] Bonnie A. Nardi,et al. Blogging by the rest of us , 2004, CHI EA '04.

[2] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3] Bernhard Pfahringer,et al. The Weka solution to the 2004 KDD Cup , 2004, SKDD.