Accurate inference of user popularity preference in a large-scale online video streaming system

With the fast growth of online video services, the service providers pursue to satisfy users’ personal preferences. Most of them have noticed the diversity of users’ preferences on video content but not that on video popularity. Only Goel et.al. [1] proved in other domains that users have different popularity preferences (PPs) and Oh et.al. [2] used the statistics of users’ PPs to improve recommendation performances. However, the statistical method to obtain users’ PPs is biased when the available historical records are so limited as that in an online video recommendation system. In this article, we characterize users’ PPs in a largescale online video streaming system from China and propose two collaborative filtering (CF) [3] based algorithms to infer users’ PPs. Compared with the statistical method, our proposed algorithms largely enhance the PP accuracy, and the enhancement gets larger with the fewer training data. Our work is beneficial for providing better personalized services. Dataset. We base our study on a a large-scale dataset from the client of PPTV, one of the largest typical online video streaming systems in China. In the dataset, we filter out the sessions shorter than 30 s where users might not be purposeful watching out of interest, and filter out the users with less than 20 records to ensure that we have enough data to evaluate the accuracy of our inference algorithm. The resulted dataset collected from March 23rd to 28th in 2011 including more than 20 thousands of movie videos, 90 thousands of users and more than 2 million of sessions. Characterization. We assign each user a PP sequence whose elements are the ordered popularity rankings of each video one has watched yet. We characterize an individual user’s PP sequence with the respective of three statistical terms: central tendency (measured by Median), dispersion tendency (by coefficient of variation (CV)) and skewness (by a normalized metric defined to be (Mean−Median)/Standard Deviation). These three characteristics above complement each other. Any single one, such as only the central tendency examined in literature [1], would be not enough to characterize the users’ PPs. To examine whether the users’ PPs are homogenous, we compare the distributions of the three PP characteristics in the real dataset and those in a null model which assumes that the users select the videos at a probability proportional to the video’s popularity homogeneously. We find the observations as below. (i) Most real users in PPTV prefer the popular videos averagely but not as significantly as that assumed in the null model, as shown in Figure 1(a). Such a gap is different in different systems. For example, the majority of users in Netflix (a movie rental system), as shown in Figure 5(a) in the literature [1], averagely prefers more popu-

[1]  Andrei Z. Broder,et al.  Anatomy of the long tail: ordinary people with extraordinary tastes , 2010, WSDM '10.

[2]  Sun Park,et al.  Novel Recommendation Based on Personal Popularity Tendency , 2011, 2011 IEEE 11th International Conference on Data Mining.

[3]  Fillia Makedon,et al.  Using singular value decomposition approximation for collaborative filtering , 2005, Seventh IEEE International Conference on E-Commerce Technology (CEC'05).

[4]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[5]  Fillia Makedon,et al.  Learning from Incomplete Ratings Using Non-negative Matrix Factorization , 2006, SDM.