Machine Learning and Knowledge Discovery in Databases

s of Journal Track Articles A Bayesian Approach for Comparing Cross-Validated Algorithms on Multiple Data Sets Giorgio Corani and Alessio Benavoli Machine Learning DOI: 10.1007/s10994-015-5486-z We present a Bayesian approach for making statistical inference about the accuracy (or any other score) of two competing algorithms which have been assessed via cross-validation on multiple data sets. The approach is constituted by two pieces. The first is a novel correlated Bayesian t-test for the analysis of the cross-validation results on a single data set which accounts for the correlation due to the overlapping training sets. The second piece merges the posterior probabilities computed by the Bayesian correlated t-test on the different data sets to make inference on multiple data sets. It does so by adopting a Poisson-binomial model. The inferences on multiple data sets account for the different uncertainty of the cross-validation results on the different data sets. It is the first test able to achieve this goal. It is generally more powerful than the signed-rank test if ten runs of cross-validation are performed, as it is anyway generally recommended. A Decomposition of the Outlier Detection Problem into a Set of Supervised Learning Problems Heiko Paulheim and Robert Meusel Machine Learning DOI: 10.1007/s10994-015-5507-y Outlier detection methods automatically identify instances that deviate from the majority of the data. In this paper, we propose a novel approach for unsupervised outlier detection, which re-formulates the outlier detection problem in numerical data as a set of supervised regression learning problems. For each attribute, we learn a predictive model which predicts the values of that attribute from the values of all other attributes, and compute the deviations between the predictions and the actual values. From those deviations, we derive both a weight for each attribute, and a final outlier score using those weights. The weights help separating the relevant attributes from the irrelevant ones, and thus make the approach well suitable for discovering outliers otherwise masked in high-dimensional data. An empirical evaluation shows that our approach outperforms existing algorithms, and is particularly robust in datasets with many irrelevant attributes. Furthermore, we show that if a symbolic machine learning method is used to solve the individual learning problems, the approach is also capable of generating concise explanations for the detected outliers. Assessing the Impact of a Health Intervention via User-Generated Internet Content Vasileios Lampos, Elad Yom-Tov, Richard Pebody, and Ingemar J. Cox Data Mining and Knowledge Discovery DOI: 10.1007/s10618-015-0427-9 Assessing the effect of a health-oriented intervention by traditional epidemiological methods is commonly based only on population segments that use healthcare services. Here we introduce a complementary framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter. The impact estimates derived from the application of the proposed statistical framework support conventional assessments of the campaign. Beyond Rankings: Comparing Directed Acyclic Graphs Eric Malmi, Nikolaj Tatti, Aristides Gionis Data Mining and Knowledge Discovery DOI: 10.1007/s10618-015-0406-1 Defining appropriate distance measures among rankings is a classic area of study which has led to many useful applications. In this paper, we propose a more general abstraction of preference data, namely directed acyclic graphs (DAGs), and introduce a measure for comparing DAGs, given that a vertex correspondence between the DAGs is known. We study the properties of this measure and use it to aggregate and cluster a set of DAGs. We show that these problems are NP-hard and present efficient methods to obtain solutions with approximation guarantees. In addition to preference data, these methods turn out to have other interesting applications, such as the analysis of a collection of information cascades in a network. We test the methods on synthetic and real-world datasets, showing that the methods can be used to, e.g., find a set of influential individuals related to a set of topics in a network or to discover meaningful and occasionally surprising clustering structure. XXII Abstracts of Journal Track Articles

[1]  Trond Steihaug,et al.  Truncated-newtono algorithms for large-scale unconstrained optimization , 1983, Math. Program..

[2]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[3]  Faisal M. Khan,et al.  Support Vector Regression for Censored Data (SVRc): A Novel Tool for Survival Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[5]  Jörg Hausleiter,et al.  Prognostic value of sensitive troponin T in patients with stable and unstable angina and undetectable conventional troponin. , 2011, American heart journal.

[6]  Antonio Eleuteri,et al.  Support Vector Machines for Survival Regression , 2011, CIBB.

[7]  J. Bergh,et al.  Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series , 2007, Clinical Cancer Research.

[8]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[9]  M. AdelsonVelskii,et al.  AN ALGORITHM FOR THE ORGANIZATION OF INFORMATION , 1963 .

[10]  Rudolf Bayer,et al.  Symmetric binary B-Trees: Data structure and maintenance algorithms , 1972, Acta Informatica.

[11]  Laurence L. George,et al.  The Statistical Analysis of Failure Time Data , 2003, Technometrics.

[12]  Ludger Evers,et al.  Sparse kernel methods for high-dimensional survival data , 2008, Bioinform..

[13]  P. V. Rao,et al.  Applied Survival Analysis: Regression Modeling of Time to Event Data , 2000 .

[14]  Sabine Van Huffel,et al.  Survival SVM: a practical scalable algorithm , 2008, ESANN.

[15]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[16]  Nagiza F. Samatova,et al.  Response-Guided Community Detection: Application to Climate Index Discovery , 2015, ECML/PKDD.

[17]  Wei Chu,et al.  A Support Vector Approach to Censored Targets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  W. Kannel,et al.  An investigation of coronary heart disease in families. The Framingham offspring study. , 1979, American journal of epidemiology.

[19]  W. Freeman,et al.  Bethe free energy, Kikuchi approximations, and belief propagation algorithms , 2001 .

[20]  Chih-Jen Lin,et al.  Large-Scale Linear RankSVM , 2014, Neural Computation.

[21]  Gunnar Rätsch,et al.  Opening the Black Box: Revealing Interpretable Sequence Motifs in Kernel-Based Learning Algorithms , 2015, ECML/PKDD.

[22]  Yair Weiss,et al.  Correctness of Local Probability Propagation in Graphical Models with Loops , 2000, Neural Computation.

[23]  Tapio Salakoski,et al.  Training linear ranking SVMs in linearithmic time using red-black trees , 2010, Pattern Recognit. Lett..

[24]  Sabine Van Huffel,et al.  Support vector machines for survival analysis , 2007 .

[25]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[26]  Ralf Bender,et al.  Generating survival times to simulate Cox proportional hazards models , 2005, Statistics in medicine.

[27]  William T. Freeman,et al.  Understanding belief propagation and its generalizations , 2003 .

[28]  Balaji Krishnapuram,et al.  On Ranking in Survival Analysis: Bounds on the Concordance Index , 2007, NIPS.

[29]  Olvi L. Mangasarian,et al.  A finite newton method for classification , 2002, Optim. Methods Softw..

[30]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[31]  S. Sathiya Keerthi,et al.  Efficient algorithms for ranking with SVMs , 2010, Information Retrieval.