Tags and titles of videos you watched tell your gender

In online video systems, viewer demographic information (gender, age, etc.) is of huge commercial value for delivering targeted advertising and video recommendations, but generally not available directly. This paper targets inferring viewers' gender based on implicit watching history in the large-scale online video systems. To tackle the sparsity problem without filtering out any cold users or videos, we not only introduce video tags as features, but also use an efficient Chinese word segmentation method to extract hot key-words from video titles as features. Moreover, users' viewing behavior distribute lognormally, hence we apply a logarithmic transformation on the inference matrixes and further find key features via principal components analysis (PCA). We then solve the gender inference as a classification problem and define some modified evaluation metrics adapt to the imbalance classification problem. We compare a set of classifiers including Class prior, EM, SVM, Logistic regression, Partially supervised soft-label and belief-based mixture and find that Logistic regression is the best. The inference results show that our algorithms can obtain high F̃1 values for all classes. The highest value of PPTV dataset can reach nearly 0.75. And inference based on key-words results in a 14.63% increase of F̃1 contrast to the ratings of MovieLens.

[1]  S. Prasad,et al.  Targeted advertising in the online video space , 2012, 2012 IEEE Systems and Information Engineering Design Symposium.

[2]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[3]  Xu Wei-ran Research On Chinese Word Segmentation Techniques , 2012 .

[4]  Kai H. Lim,et al.  Do males and females think in the same way? An empirical investigation on the gender differences in Web advertising evaluation , 2010, Comput. Hum. Behav..

[5]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[6]  Jahna Otterbacher,et al.  Inferring gender of movie reviewers: exploiting writing style, content and metadata , 2010, CIKM.

[7]  Stratis Ioannidis,et al.  BlurMe: inferring and obfuscating user gender based on ratings , 2012, RecSys.

[8]  Amy Bruckman,et al.  Gender Swapping on the Internet , 1993 .

[9]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[10]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[11]  Mia Hubert,et al.  Robust PCA and classification in biosciences , 2004, Bioinform..

[12]  Virgílio A. F. Almeida,et al.  Characterization and Analysis of User Profiles in Online Video Sharing Systems , 2010, J. Inf. Data Manag..

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  M. Jabri,et al.  Robust principal component analysis , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[16]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.