On Skewed Multi-dimensional Distributions: the FusionRP Model, Algorithms, and Discoveries

How do we model and find outliers in Twitter data? Given the number of retweets of each person on a social network, what is their expected number of comments? Real-life data are often very skewed, exhibiting power-law-like behavior. For such skewed multidimensional discrete data, the existing models are not general enough to capture various realistic scenarios, and often need to be discretized as they often model continuous quantities. We propose FusionRP, short for Fusion Restaurant Process, a simple and intuitive model for skewed multi-dimensional discrete distributions, such as number of retweets vs. comments in Twitter-like data. Our model is discrete by design, has provably asymptotic log-logistic sum of marginals , is general enough to capture varied relationships, and most importantly, and fits the real data very well. We give an effective and scalable maximum-likelihood based fitting approach that is linear in the number of unique observed values and the input dimension. We test FusionRP on a twitter-like social network with 2.2M users, a phone call network with 1.9M call records, game data with 45M users and Facebook data with 2.5M posts. Our results show that FusionRP significantly outperforms several alternative methods and can detect outliers, such as bot-like behaviors in the Facebook data.

[1]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[2]  Michalis Faloutsos,et al.  If walls could talk: Patterns and anomalies in Facebook wallposts , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[3]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[4]  Kanti V. Mardia,et al.  Multivariate Pareto Distributions , 1962 .

[5]  Danai Koutra,et al.  Fast anomaly detection despite the duplicates , 2013, WWW.

[6]  Ling Zhou,et al.  Modeling Paying Behavior in Game Social Networks , 2014, CIKM.

[7]  N. L. Johnson,et al.  Continuous Multivariate Distributions, Volume 1: Models and Applications , 2019 .

[8]  Jacek Wesolowski,et al.  Bivariate distributions via a Pareto conditional distribution and a regression function , 1995 .

[9]  Xinhua Zhang A Very Gentle Note on the Construction of Dirichlet Process , 2008 .

[10]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[11]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[12]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[13]  S. Yue The bivariate lognormal distribution to model a multivariate flood episode , 2000 .

[14]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[15]  Christos Faloutsos,et al.  Surprising Patterns for the Call Duration Distribution of Mobile Phone Users , 2010, ECML/PKDD.

[16]  F. Garwood,et al.  i) Fiducial Limits for the Poisson Distribution , 1936 .

[17]  T. Minka Estimating a Dirichlet distribution , 2012 .

[18]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[19]  Daniel Slottje,et al.  Modeling income distributions and Lorenz curves , 2010 .

[20]  Danai Koutra,et al.  Patterns amongst Competing Task Frequencies: Super-Linearities, and the Almond-DG Model , 2013, PAKDD.

[21]  S. Nadarajah A bivariate pareto model for drought , 2009 .

[22]  E. Xekalaki The Bivariate Yule Distribution and Some of its Properties , 1986 .

[23]  G. Yule,et al.  A Mathematical Theory of Evolution Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[24]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[25]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[26]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[27]  Bin Wu,et al.  How Long Will She Call Me? Distribution, Social Theory and Duration Prediction , 2013, ECML/PKDD.

[28]  Hans-Peter Kriegel,et al.  Generalized Outlier Detection with Flexible Kernel Density Estimates , 2014, SDM.