Topic classification in Romanian blogosphere

In this paper we analyze the performance of several methods for classification applied to the Romanian blogosphere. Blogs are difficult to categorize by humans and machines alike, because they are written in a changeable style. In the early days of web, directories maintained by humans could not keep up millions the websites; likewise, blog directories cannot keep up with the explosive growth of the blogsphere. This paper investigates the efficacy of using machine learning to categorize blogs written in Romanian language belonging to the Romanian blogosphere. We design a text classification experiment to categorize Romanian blogs into nine topics. The baseline feature is unigrams weighed by TF-IDF. We analyze the corpus, features, and the result data.

[1]  S. Chenthur Pandian,et al.  An Improved Approach for Topic Ontology Based Categorization of Blogs Using Support Vector Machine , 2012 .

[2]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[3]  Saadat M. Alhashmi,et al.  Enhancing Concept Based Modeling Approach for Blog Classification , 2011 .

[4]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Denilson Barbosa,et al.  Topic Classification of Blog Posts Using Distant Supervision , 2012 .

[8]  Bujor Pavaloiu,et al.  Building a specialized high performance web crawler , 2013, 2013 20th International Conference on Systems, Signals and Image Processing (IWSSIP).

[9]  Mukesh A. Zaveri,et al.  Automatic Classification of Unstructured Blog Text , 2013 .

[10]  R. C. Joshi,et al.  Semantic tagging and classification of blogs , 2010, 2010 International Conference on Computer and Communication Technology (ICCCT).

[11]  Stefan Trausan-Matu,et al.  Ontology-based flexible topic classification of crowdsourcing textual resources , 2013, MEDES.

[12]  Traian Rebedea,et al.  Autonomous News Clustering and Classification for an Intelligent Web Portal , 2008, ISMIS.

[13]  Mohammad Ghodsi,et al.  PostRank: a new algorithm for incremental finding of persian blog representative words , 2012, WIMS '12.

[14]  Hong Qu,et al.  Automated Blog Classification: Challenges and Pitfalls , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[15]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.