A Big Data approach to gender classification in Twitter: Notebook for PAN at CLEF 2018
暂无分享,去创建一个
This paper describes a statistical approach to the task of gender classification in tweets, with a Big Data perspective in mind. Our task started developing our own implementation of Low Dimension Representation method, with the idea to add some other statistics which had not been used in the original implementation, such as skewness, kurtosis and central moments. Exploratory analysis of the new characteristics showed the importance of skewness due to the problem presents only 2 classes. Our approach will only use skewness for describing the difference in use of the language between men and women and skewness, as well, will be used to predict gender for the test dataset.
[1] Paolo Rosso,et al. A Low Dimensionality Representation for Language Variety Identification , 2016, CICLing.
[2] Benno Stein,et al. Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter , 2018, CLEF.