Feature Selection and Data Sampling Methods for Learning Reputation Dimensions

We report on our participation in the reputation dimension task of the CLEF RepLab 2014 evaluation initiative, i.e., to classify social media updates into eight predefined categories. We address the task by using corpus-based methods to extract textual features from the labeled training data to train two classifiers in a supervised way. We explore three sampling strategies for selecting training examples, and probe their effect on classification performance. We find that all our submitted runs outperform the baseline, and that elaborate feature selection methods coupled with balanced datasets help improve classification accuracy.

[1]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[2]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[4]  Rish,et al.  An analysis of data characteristics that affect naive Bayes performance , 2001 .

[5]  Jon Patrick,et al.  Selecting Systemic Features for Text Classification , 2004, ALTA.

[6]  Julio Gonzalo,et al.  Towards an Active Learning System for Company Name Disambiguation in Microblog Streams , 2013, CLEF.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Julio Gonzalo,et al.  Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems , 2013, CLEF.

[9]  Julio Gonzalo,et al.  Overview of RepLab 2014: Author Profiling and Reputation Dimensions for Online Reputation Management , 2014, CLEF.

[10]  M. de Rijke,et al.  Detecting the Reputation Polarity of Microblog Posts , 2014, ECAI.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[13]  Julio Gonzalo,et al.  UNED Online Reputation Monitoring Team at RepLab 2013 , 2013, CLEF.

[14]  Richárd Farkas,et al.  Filtering and Polarity Detection for Reputation Management on Tweets , 2013, CLEF.

[15]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[16]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[17]  Julio Gonzalo,et al.  Overview of RepLab 2012: Evaluating Online Reputation Management Systems , 2012, CLEF.

[18]  Karl-Michael Schneider,et al.  Techniques for Improving the Performance of Naive Bayes for Text Classification , 2005, CICLing.

[19]  Ana M. García-Serrano,et al.  Modelling Techniques for Twitter Contents: A Step beyond Classification based Approaches , 2013, CLEF.

[20]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[21]  Ángel F. Zazo Rodríguez,et al.  REINA at RepLab2013 Topic Detection Task: Community Detection , 2013, CLEF.