Efficient Cross-Domain Classification of Weblogs

Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.

[1]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[2]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[3]  Qiang Yang,et al.  Topic-bridged PLSA for cross-domain text classification , 2008, SIGIR '08.

[4]  Michael Granitzer,et al.  Cross-domain classification: Trade-off between complexity and accuracy , 2009, 2009 International Conference for Internet Technology and Secured Transactions, (ICITST).

[5]  Christin Seifert,et al.  A Novel Visualization Approach for Data-Mining-Related Classification , 2009, 2009 13th International Conference Information Visualisation.

[6]  Minyi Guo,et al.  A class-feature-centroid classifier for text categorization , 2009, WWW '09.

[7]  Arno Scharl,et al.  Distributed Web2.0 crawling for ontology evolution , 2007, 2007 2nd International Conference on Digital Information Management.

[8]  Michael Granitzer,et al.  Blog credibility ranking by exploiting verified content , 2009, WICOW.

[9]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[10]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[11]  Jian Hu,et al.  Using Wikipedia for Co-clustering Based Cross-Domain Text Classification , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  Verayuth Lertnattee,et al.  Effect of term distributions on centroid-based text categorization , 2004, Inf. Sci..

[13]  Wolfgang Kienreich,et al.  A Generic Framework for Visualizing the News Article Domain and its Application to Real-World Data , 2008, J. Digit. Inf. Manag..

[14]  Yang Song,et al.  Evaluating tagging behavior in social bookmarking systems: metrics and design heuristics , 2007, GROUP.

[15]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[16]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[17]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[20]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[21]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.