Semantically-Aware Statistical Metrics via Weighting Kernels

Distance metrics between statistical distributions are widely used as an efficient mean to aggregate/simplify the underlying probabilities, thus enabling high-level analyses. In this paper we investigate the collisions that can arise with such metrics, and a mitigation technique rooted on kernels. In detail, we first show that the existence of colliding functions (so-called iso-curves) is widespread across metrics and families of functions (e.g., gaussians, heavy-tailed). Later, we propose a solution based on kernels for augmenting distance metrics and summary statistics, thus avoiding collisions and highlighting semantically-relevant phenomena. This study is supported by a thorough theoretical evaluation of our solution against a large number of functions and metrics, complemented by a real-world evaluation carried out by applying our solution to an existing problem. Some further research venues are also discussed. The theoretical construction and the achieved results show the soundness, viability, and quality of our proposal that, other being interesting on its own, also paves the way for further research in the highlighted directions.

[1]  Roberto Di Pietro,et al.  Exploiting Digital DNA for the Analysis of Similarities in Twitter Behaviours , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[2]  Zhi-Hua Zhou,et al.  Isolation Kernel and Its Effect on SVM , 2018, KDD.

[3]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[4]  Cesare Furlanello,et al.  The HIM glocal metric and kernel for network comparison and classification , 2012, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[5]  A. Lyon,et al.  Why are Normal Distributions Normal? , 2014, The British Journal for the Philosophy of Science.

[6]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[7]  Roberto Di Pietro,et al.  Fame for sale: Efficient detection of fake Twitter followers , 2015, Decis. Support Syst..

[8]  Albert-László Barabási,et al.  The origin of bursts and heavy tails in human dynamics , 2005, Nature.

[9]  Christos Faloutsos,et al.  Modeling Temporal Activity to Detect Anomalous Behavior in Social Media , 2017, ACM Trans. Knowl. Discov. Data.

[10]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[11]  Roberto Di Pietro,et al.  DNA-Inspired Online Behavioral Modeling and Its Application to Spambot Detection , 2016, IEEE Intell. Syst..

[12]  Christos Faloutsos,et al.  REV2: Fraudulent User Prediction in Rating Platforms , 2018, WSDM.

[13]  V. S. Subrahmanian,et al.  KDD 2017 Tutorial : Data-Driven Approaches towards Malicious Behavior Modeling , 2017 .

[14]  Krishna P. Gummadi,et al.  Strength in Numbers: Robust Tamper Detection in Crowd Computations , 2015, COSN.

[15]  Fabrizio Lillo,et al.  $FAKE: Evidence of Spam and Bot Activity in Stock Microblogs on Twitter , 2018, ICWSM.

[16]  Filippo Menczer,et al.  Online Human-Bot Interactions: Detection, Estimation, and Characterization , 2017, ICWSM.

[17]  Lise Getoor,et al.  Using Semantics and Statistics to Turn Data into Knowledge , 2015, AI Mag..

[18]  Fabrizio Lillo,et al.  Cashtag Piggybacking , 2018, ACM Trans. Web.

[19]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[20]  George W. Fitzmaurice,et al.  Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing , 2017, CHI.

[21]  Marius Kloft,et al.  Learning Kernels Using Local Rademacher Complexity , 2013, NIPS.

[22]  Rose Yu,et al.  GLAD: group anomaly detection in social media analysis , 2014, ACM Trans. Knowl. Discov. Data.

[23]  Evgeny Burnaev,et al.  Kernel Regression on Manifold Valued Data , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[24]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[25]  F. J. Anscombe,et al.  Graphs in Statistical Analysis , 1973 .

[26]  Roberto Di Pietro,et al.  Social Fingerprinting: Detection of Spambot Groups Through DNA-Inspired Behavioral Modeling , 2017, IEEE Transactions on Dependable and Secure Computing.

[27]  Stéphane Ayache,et al.  Sparse Domain Adaptation in Projection Spaces Based on Good Similarity Functions , 2011, 2011 IEEE 11th International Conference on Data Mining.

[28]  Christos Faloutsos,et al.  Suspicious Behavior Detection: Current Trends and Future Directions , 2016, IEEE Intelligent Systems.

[29]  James A. Hendler,et al.  Semantics for Big Data , 2015, AI Mag..

[30]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[31]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[32]  Huanhuan Chen,et al.  Disturbance Grassmann Kernels for Subspace-Based Learning , 2018, KDD.

[33]  Pierre Geurts,et al.  Random Forests with Random Projections of the Output Space for High Dimensional Multi-label Classification , 2014, ECML/PKDD.

[34]  Gonzalo Navarro,et al.  An empirical evaluation of intrinsic dimension estimators , 2015, Inf. Syst..

[35]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[36]  Daniel A. Keim,et al.  Human-centered machine learning through interactive visualization , 2016 .

[37]  Joshua B. Tenenbaum,et al.  Structure Discovery in Nonparametric Regression through Compositional Kernel Search , 2013, ICML.

[38]  M. Pazzani,et al.  The Utility of Knowledge in Inductive Learning , 1992, Machine Learning.