Feature selection and data sampling methods for learning reputation dimensions: The University of Amsterdam at RepLab 2014

We report on our participation in the reputation dimension task of the CLEF RepLab 2014 evaluation initiative, i.e., to classify social media updates into eight predefined categories. We address the task by using corpus-based meth- ods to extract textual features from the labeled training data to train two classifiers in a supervised way. We explore three sampling strategies for selecting training examples, and probe their effect on classification performance. We find that all our submitted runs outperform the baseline, and that elaborate feature selection methods coupled with balanced datasets help improve classification accuracy.

[1]  Ángel F. Zazo Rodríguez,et al.  REINA at RepLab2013 Topic Detection Task: Community Detection , 2013, CLEF.

[2]  Richárd Farkas,et al.  Filtering and Polarity Detection for Reputation Management on Tweets , 2013, CLEF.

[3]  Julio Gonzalo,et al.  Towards an Active Learning System for Company Name Disambiguation in Microblog Streams , 2013, CLEF.

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Julio Gonzalo,et al.  Overview of RepLab 2012: Evaluating Online Reputation Management Systems , 2012, CLEF.

[7]  Julio Gonzalo,et al.  UNED Online Reputation Monitoring Team at RepLab 2013 , 2013, CLEF.

[8]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[11]  Julio Gonzalo,et al.  Overview of RepLab 2014: Author Profiling and Reputation Dimensions for Online Reputation Management , 2014, CLEF.

[12]  Karl-Michael Schneider,et al.  Techniques for Improving the Performance of Naive Bayes for Text Classification , 2005, CICLing.

[13]  Ana M. García-Serrano,et al.  Modelling Techniques for Twitter Contents: A Step beyond Classification based Approaches , 2013, CLEF.

[14]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[15]  Julio Gonzalo,et al.  Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems , 2013, CLEF.

[16]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[17]  M. de Rijke,et al.  Detecting the Reputation Polarity of Microblog Posts , 2014, ECAI.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Jon Patrick,et al.  Selecting Systemic Features for Text Classification , 2004, ALTA.

[20]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .