Finding patterns in features and observations: new machine learning models with applications in computational criminology, marketing, and medicine

The revolution of "Big Data" has reached various fields like marketing, healthcare, and criminology, where domain experts wish to find and understand interesting patterns from data. This thesis studies patterns that are defined by subsets of observations or subsets of features. The first part of the thesis studies patterns defined by subsets of observations. We look at a specific type of pattern, crime series (a set of crimes committed by the same individual or group) and develop two pattern detection algorithms. The first method is a sequential pattern building algorithm called Series Finder, which resembles how crime analysts process information instinctively and grows a crime series starting from a couple of seed crimes. The second method is a subspace clustering with cluster-specific feature selection, which is supervised when learning similarity graphs in order to reduce computation. Both methods we propose achieved promising results on a decade's worth of crime pattern data collected by the Crime Analysis Unit of the Cambridge Police Department. The second part of the thesis studies patterns defined by subsets of features. We develop methods and theory for building Rule Set models with the hallmark of interpretability. Interpretability is inherent in using association rules to explain predicted results. We first design two methods for building rule sets for binary classification. The first method Bayesian Rule Set (BRS) uses a Bayesian framework with priors that favor small models. The Bayesian priors also bring significant computational benefits to MAP inferences by reducing the search space and restraining the sampling chain within appropriate regions. We apply BRS models to an in-vehicle recommender system data set we collected via Amazon Mechanical Turk to study the customers and contexts that would encourage acceptance of coupons. We develop another model Optimized Rule Set (ORS) using optimization methods to directly construct rule sets from data, without pre-mining rules or discretizing continuous attributes. As a main application of ORS, we build a diagnostic screening tool for obstructive sleep apnea trained on data provided by the Sleep Lab at Mass General Hospital. Our models achieve high accuracy with a substantial gain in interpretability over other

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[3]  Cynthia Rudin,et al.  Supersparse linear integer models for optimized medical scoring systems , 2015, Machine Learning.

[4]  Vitaly Feldman Hardness of approximate two-level logic minimization and PAC learning with membership queries , 2009, J. Comput. Syst. Sci..

[5]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Shyam Varan Nath,et al.  Crime Pattern Detection Using Data Mining , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[7]  Bart Baesens,et al.  An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models , 2011, Decis. Support Syst..

[8]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[9]  Glenn Zorpette,et al.  The pitfalls of prediction [Spectral Lines] , 2014 .

[10]  Kamal Dahbur,et al.  Classification System for Serial Criminal Patterns , 2003, Artificial Intelligence and Law.

[11]  Hsinchun Chen,et al.  Using Coplink to Analyze Criminal-Justice Data , 2002, Computer.

[12]  Vitaly Feldman Learning DNF Expressions from Fourier Spectrum , 2012, COLT.

[13]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[14]  Katherine A. Heller,et al.  Growing a list , 2013, Data Mining and Knowledge Discovery.

[15]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[16]  James J. Chen,et al.  Subgroup identification for treatment selection in biomarker adaptive design , 2015, BMC Medical Research Methodology.

[17]  Cynthia Rudin,et al.  A Hierarchical Model for Association Rule Mining of Sequential Events: An Approach to Automated Medical Symptom Prediction , 2011 .

[18]  Gang Wang,et al.  Crime data mining: a general framework and some examples , 2004, Computer.

[19]  George F. Rengert,et al.  Near-Repeat Patterns in Philadelphia Shootings , 2008 .

[20]  Beth Pearsall,et al.  Predictive Policing: The Future of Law Enforcement? , 2010 .

[21]  Bart Baesens,et al.  Performance of classification models from a user perspective , 2011, Decis. Support Syst..

[22]  Alex Alves Freitas,et al.  Comprehensible classification models: a position paper , 2014, SKDD.

[23]  Gediminas Adomavicius,et al.  Context-aware recommender systems , 2008, RecSys '08.

[24]  B. Carlin,et al.  A Bayesian credible subgroups approach to identifying patient subgroups with positive treatment effects , 2016, Biometrics.

[25]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[26]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[27]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[28]  Donald E. Brown,et al.  An Outlier-based Data Association Method for Linking Criminal Incidents , 2003, SDM.

[29]  David H. Reiley,et al.  Online ads and offline sales: measuring the effect of retail advertising via a controlled experiment on Yahoo! , 2014 .

[30]  J. Hoenicka,et al.  Clinical predictors of response to naltrexone in alcoholic patients: who benefits most from treatment with naltrexone? , 2005, Alcohol and alcoholism.

[31]  S. Gwinn Exploring Crime Analysis: Readings on Essential Skills , 2009 .

[32]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[33]  Greg M. Allenby,et al.  A Choice Model with Conjunctive, Disjunctive, and Compensatory Screening Rules , 2004 .

[34]  D. I. Cook,et al.  Subgroup analysis in clinical trials , 2004, The Medical journal of Australia.

[35]  Paulo Cortez,et al.  A data-driven approach to predict the success of bank telemarketing , 2014, Decis. Support Syst..

[36]  Gang Wang,et al.  Automatically detecting deceptive criminal identities , 2004, CACM.

[37]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[38]  Jeremy MG Taylor,et al.  Subgroup Identification in Personalized Treatment of Alcohol Dependence. , 2015, Alcoholism, clinical and experimental research.

[39]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[40]  Cynthia Rudin,et al.  Learning theory analysis for association rules and sequential event prediction , 2013, J. Mach. Learn. Res..

[41]  Ming-Syan Chen,et al.  On the mining of substitution rules for statistically dependent items , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[42]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[43]  S. Assmann,et al.  Subgroup analysis and other (mis)uses of baseline data in clinical trials , 2000, The Lancet.

[44]  Christopher M. Gifford,et al.  Fuzzy association rule mining for community crime pattern discovery , 2010, ISI-KDD '10.

[45]  Stefan Rüping,et al.  Learning interpretable models , 2006 .

[46]  Yimin Liu,et al.  Or's of And's for Interpretable Classification, with Application to Context-Aware Recommender Systems , 2015, ArXiv.

[47]  Cynthia Rudin,et al.  Falling Rule Lists , 2014, AISTATS.

[48]  George E. Tita,et al.  Self-Exciting Point Process Modeling of Crime , 2011 .

[49]  T. Evgeniou,et al.  Disjunctions of Conjunctions, Cognitive Simplicity, and Consideration Sets , 2010 .

[50]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[51]  David Weisburd,et al.  Crime and Disorder in Drug Hot Spots: Implications for Theory and Practice in Policing , 2000 .

[52]  Erik Duval,et al.  Context-Aware Recommender Systems for Learning: A Survey and Future Challenges , 2012, IEEE Transactions on Learning Technologies.

[53]  Gregory F. Cooper,et al.  A multivariate Bayesian scan statistic for early event detection and characterization , 2010, Machine Learning.

[54]  Martin Ester,et al.  Mining Cohesive Patterns from Graphs with Feature Vectors , 2009, SDM.

[55]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[56]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[57]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[58]  D. Weisburd Bringing Social Context Back into the Equation: The Importance of Social Characteristics of Places in the Prevention of Crime , 2015 .

[59]  George E. Tita,et al.  Measuring and Modeling Repeat and Near-Repeat Burglary Effects , 2009 .

[60]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[61]  Simon Price,et al.  Inductive Logic Programming , 2000, Lecture Notes in Computer Science.

[62]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[63]  Philip S. Yu,et al.  MaPle: a fast algorithm for maximal pattern-based clustering , 2003, Third IEEE International Conference on Data Mining.

[64]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[65]  Adam R. Klivans,et al.  Learning DNF in time 2 Õ(n 1/3 ) . , 2001, STOC 2001.

[66]  Stephen Chi-fai Chan,et al.  Incremental Mining for Temporal Association Rules for Crime Pattern Discoveries , 2007, ADC.

[67]  Andrea L. Bertozzi,et al.  c ○ World Scientific Publishing Company A STATISTICAL MODEL OF CRIMINAL BEHAVIOR , 2008 .

[68]  Christian Borgelt,et al.  An implementation of the FP-growth algorithm , 2005 .

[69]  Tong Wang,et al.  Learning Optimized Or's of And's , 2015, ArXiv.

[70]  S. Chainey,et al.  Mapping Crime: Understanding Hot Spots , 2014 .

[71]  Xing Zhang,et al.  A new approach to classification based on association rule mining , 2006, Decis. Support Syst..

[72]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[73]  Matthias Baldauf,et al.  A survey on context-aware systems , 2007, Int. J. Ad Hoc Ubiquitous Comput..

[74]  D. Nelson,et al.  Identification of an optimal subgroup for treatment evaluation of patients with brain metastases using RTOG study 7916. , 1989, International journal of radiation oncology, biology, physics.

[75]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[76]  Bart Baesens,et al.  Building Acceptable Classification Models , 2010, Data Mining.

[77]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[78]  C. Rudin,et al.  Clinical Prediction Models for Sleep Apnea: The Importance of Medical History over Symptoms. , 2016, Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine.

[79]  Cynthia Rudin,et al.  Box drawings for learning with imbalanced data , 2014, KDD.

[80]  R. Dawes Judgment under uncertainty: The robust beauty of improper linear models in decision making , 1979 .

[81]  Xindong Wu,et al.  Mining Both Positive and Negative Association Rules , 2002, ICML.

[82]  Ariel D. Procaccia,et al.  Exact VC-Dimension of Monotone Formulas , 2006 .

[83]  Niklas Lavesson,et al.  User-oriented Assessment of Classification Model Understandability , 2011, SCAI.

[84]  Geoffrey P. Goodwin,et al.  Logic, probability, and human reasoning , 2015, Trends in Cognitive Sciences.

[85]  Tong Wang,et al.  Detecting Patterns of Crime with Series Finder , 2013, AAAI.

[86]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[87]  Donald E. Brown,et al.  Data association methods with applications to law enforcement , 2003, Decis. Support Syst..

[88]  Cynthia Rudin,et al.  Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[89]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[90]  Thomas Seidl,et al.  DB-CSC: A Density-Based Approach for Subspace Clustering in Graphs with Feature Vectors , 2011, ECML/PKDD.

[91]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.