Data Science of the People , for the People , by the People : A Viewpoint on an Emerging Dichotomy

This paper presents a viewpoint on an emerging dichotomy in data science: applications in which predictions of datadriven algorithms are used to support people in making consequential decisions that can have a profound effect on other people’s lives and applications in which data-driven algorithms act autonomously in settings of low consequence and large scale. An example of the first type of application is prison sentencing and of the second type is selecting news stories to appear on a person’s web portal home page. It is argued that the two types of applications require data, algorithms and models with vastly different properties along several dimensions, including privacy, equitability, robustness, interpretability, causality, and openness. Furthermore, it is argued that the second type of application cannot always be used as a surrogate to develop methods for the first type of application. To contribute to the development of methods for the first type of application, one must really be working on the first type of application.

[1]  Josep Domingo-Ferrer,et al.  Generalization-based privacy preservation and discrimination prevention in data publishing and mining , 2014, Data Mining and Knowledge Discovery.

[2]  Kush R. Varshney,et al.  Exact Rule Learning via Boolean Compressed Sensing , 2013, ICML.

[3]  Deepak Agarwal,et al.  Content recommendation on web portals , 2013, CACM.

[4]  Cynthia Rudin,et al.  Falling Rule Lists , 2014, AISTATS.

[5]  Kush R. Varshney,et al.  A Semiquantitative Group Testing Approach for Learning Interpretable Clinical Prediction Rules , 2015 .

[6]  R. Rivest Learning Decision Lists , 1987, Machine Learning.

[7]  Foster Provost,et al.  Causally motivated attribution for online advertising , 2012, ADKDD '12.

[8]  Kush R. Varshney,et al.  Robust binary hypothesis testing under contaminated likelihoods , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Cynthia Rudin,et al.  Supersparse linear integer models for optimized medical scoring systems , 2015, Machine Learning.

[10]  Richard G. Baraniuk,et al.  Tuning Support Vector Machines for Minimax and Neyman-Pearson Classification , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ron Kohavi,et al.  Responsible editor: R. Bayardo. , 2022 .

[12]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[13]  Wojciech Kotlowski,et al.  ENDER: a statistical framework for boosting decision rules , 2010, Data Mining and Knowledge Discovery.

[14]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[15]  Ram Akella,et al.  Estimating Ad Impact on Clicker Conversions for Causal Attribution: A Potential Outcomes Approach , 2015, SDM.

[16]  Salvatore Ruggieri,et al.  Using t-closeness anonymity to control for non-discrimination , 2015, Trans. Data Priv..

[17]  Cynthia Rudin,et al.  Algorithms for interpretable machine learning , 2014, KDD.

[18]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[19]  Bernhard Schölkopf,et al.  Causal discovery with continuous additive noise models , 2013, J. Mach. Learn. Res..

[20]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[21]  Cynthia Rudin,et al.  Machine learning for science and society , 2013, Machine Learning.

[22]  Kush R. Varshney,et al.  Health Insurance Market Risk Assessment: Covariate Shift and k-Anonymity , 2015, SDM.

[23]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[24]  Stefan Gottschalk,et al.  Privacy Preserving Data Mining Models And Algorithms , 2016 .

[25]  Cynthia Rudin,et al.  Robust Optimization using Machine Learning for Uncertainty Sets , 2014, ISAIM.

[26]  Cynthia Rudin,et al.  The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification , 2014, NIPS.

[27]  Kush R. Varshney,et al.  Quantization of Prior Probabilities for Hypothesis Testing , 2022 .

[28]  Justin Reich,et al.  Privacy, Anonymity, and Big Data in the Social Sciences , 2014 .

[29]  Christian Hansen,et al.  High-Dimensional Methods and Inference on Structural and Treatment Effects , 2013 .

[30]  Brian L. Spatocco,et al.  Targeting Villages for Rural Development Using Satellite Image Analysis , 2015, Big Data.

[31]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[32]  Steffen Bickel,et al.  Transfer Learning by Distribution Matching for Targeted Advertising , 2008, NIPS.

[33]  Kiri Wagstaff,et al.  Machine Learning that Matters , 2012, ICML.

[34]  Bogdan E. Popescu,et al.  PREDICTIVE LEARNING VIA RULE ENSEMBLES , 2008, 0811.1679.

[35]  Sophie Chou,et al.  Effecting positive social change with data science , 2014 .

[36]  Roseli A. Francelin Romero,et al.  Machine Learning Techniques Applied to Dynamic Video Adapting , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[37]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[38]  Robert Koprowski,et al.  Machine learning, medical diagnosis, and biomedical engineering research - commentary , 2014, BioMedical Engineering OnLine.

[39]  Cynthia Rudin,et al.  An Integer Optimization Approach to Associative Classification , 2012, NIPS.

[40]  Johan A. K. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring , 2003, J. Oper. Res. Soc..

[41]  Stefan Kramer,et al.  Margin-Based First-Order Rule Learning , 2006, ILP.

[42]  Josep Domingo-Ferrer,et al.  A Methodology for Direct and Indirect Discrimination Prevention in Data Mining , 2013, IEEE Transactions on Knowledge and Data Engineering.

[43]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[44]  Panagiotis G. Ipeirotis,et al.  Beat the Machine: Challenging Humans to Find a Predictive Model's “Unknown Unknowns” , 2015, JDIQ.

[45]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[46]  Richard A. Berk,et al.  Machine Learning Forecasts of Risk to Inform Sentencing Decisions , 2015 .

[47]  Kush R. Varshney,et al.  An Analytics Approach for Proactively Combating Voluntary Attrition of Employees , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[48]  Arnaud Sahuguet,et al.  Open Civic Data: Of the People, For the People, By the People , 2014, IEEE Data Eng. Bull..

[49]  C. Rudin,et al.  Building Interpretable Classifiers with Rules using Bayesian Analysis , 2012 .

[50]  Alex Alves Freitas,et al.  Comprehensible classification models: a position paper , 2014, SKDD.