论文信息 - Feature Selection and Data Sampling Methods for Learning Reputation Dimensions

Feature Selection and Data Sampling Methods for Learning Reputation Dimensions

We report on our participation in the reputation dimension task of the CLEF RepLab 2014 evaluation initiative, i.e., to classify social media updates into eight predefined categories. We address the task by using corpus-based methods to extract textual features from the labeled training data to train two classifiers in a supervised way. We explore three sampling strategies for selecting training examples, and probe their effect on classification performance. We find that all our submitted runs outperform the baseline, and that elaborate feature selection methods coupled with balanced datasets help improve classification accuracy.

Maarten de Rijke | Cristina Garbacea | Manos Tsagkias

[1] Owen Rambow,et al. Sentiment Analysis of Twitter Data , 2011 .

[2] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3] Thorsten Joachims,et al. A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[4] Rish,et al. An analysis of data characteristics that affect naive Bayes performance , 2001 .

[5] Jon Patrick,et al. Selecting Systemic Features for Text Classification , 2004, ALTA.

[6] Julio Gonzalo,et al. Towards an Active Learning System for Company Name Disambiguation in Microblog Streams , 2013, CLEF.

[7] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8] Julio Gonzalo,et al. Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems , 2013, CLEF.

[9] Julio Gonzalo,et al. Overview of RepLab 2014: Author Profiling and Reputation Dimensions for Online Reputation Management , 2014, CLEF.

[10] M. de Rijke,et al. Detecting the Reputation Polarity of Microblog Posts , 2014, ECAI.

[11] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..