Predicting Longitudinal User Activity at Fine Time Granularity in Online Collaborative Platforms

This paper introduces a decomposition approach to address the problem of predicting different user activities at hour granularity over a long period of time. Our approach involves two steps. First, we used a temporal neural network ensemble to predict the number of each type of activity that occurred in a day. Second, we used a set of neural networks to assign the events to a user-repository pair in a particular hour. We focused this work on a subset of the public GitHub dataset that records the activities of over 2 million users on over 400,000 software repositories. Our experiments show we were able to predict hourly user-repo activity with reasonably low error. Our simulations are accurate for 1–3 weeks (168–504 hours) after inception, with accuracy gradually falling off. It was shown that activity on Twitter and Reddit increases the accuracy of activity prediction on GitHub for most events.

[1]  N. Nagelkerke,et al.  A note on a general definition of the coefficient of determination , 1991 .

[2]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[3]  Emily B. Fox,et al.  A Unified Framework for Long Range and Cold Start Forecasting of Seasonal Profiles in Time Series , 2017, ArXiv.

[4]  Hubert Cardot,et al.  A new boosting algorithm for improved time-series forecasting with recurrent neural networks , 2008, Inf. Fusion.

[5]  Young Bin Kim,et al.  Predicting Fluctuations in Cryptocurrency Transactions Based on User Comments and Replies , 2016, PloS one.

[6]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[7]  Ganapati Panda,et al.  Sentiment analysis of Twitter data for predicting stock market movements , 2016, 2016 International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES).

[8]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[9]  Fabian Flöck,et al.  Evolution of reddit: from the front page of the internet to a self-referential community? , 2014, WWW.

[10]  Shan Lu,et al.  Aggregating multiple types of complex data in stock market prediction: A model-independent framework , 2018, Knowl. Based Syst..

[11]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[12]  Moreno Mancosu,et al.  Using deep-learning algorithms to derive basic characteristics of social media users: The Brexit campaign as a case study , 2019, PloS one.

[13]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[14]  Sridha Sridharan,et al.  Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection , 2017, Neural Networks.

[15]  Juan José Rodríguez Diez,et al.  Random Balance: Ensembles of variable priors classifiers for imbalanced data , 2015, Knowl. Based Syst..

[16]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Berkant Barla Cambazoglu,et al.  On the feasibility of predicting popular news at cold start , 2017, J. Assoc. Inf. Sci. Technol..

[18]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[19]  Ari Ercole,et al.  Optimal intensive care outcome prediction over time using machine learning , 2018, PloS one.

[20]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[21]  Jesse Hoey,et al.  Artificial Intelligence and Social Simulation: Studying Group Dynamics on a Massive Scale , 2018, Small Group Research.

[22]  Ilias N. Lymperopoulos,et al.  Understanding and modeling the complex dynamics of the online social networks: a scalable conceptual approach , 2016, Evolving Systems.