A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data

We introduce a tree-based approach for assessing the performance impact of diverse self-selected interventions in management research. Our approach, which takes advantage of "Big Data", or observational data with large sample sizes and a large number of variables, offers important advantages over traditional propensity score matching. In particular, the tree-based approach to assessing the impact of interventions offers a data-driven methodology that applies to a wide range of intervention types (binary, polytomous, continuous), allows for examination of nascent interventions whose selection cannot be theoretically specified a priori, identifies pre-intervention variables that correlate with the self-selected intervention, and presents comparisons of ensuing performance in visuals that are easy to discern and understand. We illustrate the method and the insights that it yields in the context of two studies: analysis of the impact of an eGov service in India, and comparison of performance across different contractual pricing mechanisms and contract durations in the outsourcing of technology and technology-enabled business functions.

[1]  Ritu Agarwal,et al.  Research Note - Social Interactions and the "Digital Divide": Explaining Variations in Internet Use , 2009, Inf. Syst. Res..

[2]  D. Almirall,et al.  Do CRM Systems Cause One-to-One Marketing Effectiveness? , 2006, math/0609199.

[3]  Lauren Keller Johnson,et al.  Successful business process outsourcing , 2006 .

[4]  Eric Overby,et al.  Electronic and Physical Market Channels: A Multiyear Investigation in a Market for Products of Uncertain Quality , 2009, Manag. Sci..

[5]  Kent D. Daniel,et al.  Market Reactions to Tangible and Intangible Information , 2001 .

[6]  Mayuram S. Krishnan,et al.  From Association to Causation via a Potential Outcomes Approach , 2009, Inf. Syst. Res..

[7]  Henry C. Lucas,et al.  Are Foreign IT Workers Cheaper? U.S. Visa Policies and Compensation of Information Technology Professionals , 2010, Manag. Sci..

[8]  W. Dutton Information and Communication Technologies: Visions and Realities , 1996 .

[9]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[10]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[11]  R. D'Agostino Adjustment Methods: Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non‐Randomized Control Group , 2005 .

[12]  Mayuram S. Krishnan,et al.  Contracts in Offshore Software Development: An Empirical Analysis , 2003, Manag. Sci..

[13]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[14]  J. Steyaert,et al.  Governing in the information age , 2000 .

[15]  Anitesh Barua,et al.  Does strategic outsourcing create financial value ? , 2009 .

[16]  Ravi Bapna,et al.  Do Your Online Friends Make You Pay? A Randomized Field Experiment on Peer Influence in Online Social Networks - Online E-Companion Appendix , 2014, Manag. Sci..

[17]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[18]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[19]  D. McCaffrey,et al.  Propensity score estimation with boosted regression for evaluating causal effects in observational studies. , 2004, Psychological methods.

[20]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[21]  Donald B. Rubin,et al.  Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology , 2006 .

[22]  Achim Zeileis,et al.  A New, Conditional Variable-Importance Measure for Random Forests Available in the party Package , 2009 .

[23]  Charles W. Hofer,et al.  Strategic management : a new view of business policy and planning , 1980 .

[24]  Sumit Sarkar,et al.  Protecting Privacy Against Record Linkage Disclosure: A Bounded Swapping Approach for Numeric Data , 2011, Inf. Syst. Res..

[25]  Steven Tadelis,et al.  Incentives Versus Transaction Costs: A Theory of Procurement Contracts , 2001 .

[26]  Yang Liu,et al.  Research of Decision Tree on YARN Using MapReduce and Spark , 2014 .

[27]  Elizabeth A Stuart,et al.  Matching methods for causal inference: A review and a look forward. , 2010, Statistical science : a review journal of the Institute of Mathematical Statistics.

[28]  C. Fornell,et al.  Why Do Customer Relationship Management Applications Affect Customer Satisfaction? , 2005 .

[29]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[30]  Andrew B. Whinston,et al.  An Empirical Analysis of the Contractual and Information Structures of Business Process Outsourcing Relationships , 2012, Inf. Syst. Res..

[31]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[32]  Galit Shmueli,et al.  One-Way Mirrors in Online Dating: A Randomized Field Experiment , 2016, Manag. Sci..

[33]  Sumit Sarkar,et al.  Lying on the Web: Implications for Expert Systems Redesign , 2005, Inf. Syst. Res..

[34]  Kweku-Muata Osei-Bryson,et al.  Reexamining the impact of information technology investment on productivity using regression tree and multivariate adaptive regression splines (MARS) , 2008, Inf. Technol. Manag..

[35]  J. Pearl Simpson's Paradox: An Anatomy , 2011 .

[36]  Bruce A. Weinberg,et al.  Experience and Technology Adoption , 2004, SSRN Electronic Journal.

[37]  Elaine L. Zanutto A Comparison of Propensity Score and Linear Regression Analysis of Complex Survey Data , 2021, Journal of Data Science.

[38]  S. Masten,et al.  Mitigating Contractual Hazards: Unilateral Options and Contract Length , 1988 .

[39]  Balaji Padmanabhan,et al.  On the Use of Optimization for Data Mining: Theoretical Interactions and eCRM Opportunities , 2003, Manag. Sci..

[40]  Richard Heeks,et al.  Understanding Success and Failure in Information Age Reform , 1998 .

[41]  Gregory G. Dess,et al.  Measuring organizational performance in the absence of objective measures: The case of the privately-held firm and conglomerate business unit , 1984 .

[42]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[43]  Vijay S. Mookerjee,et al.  Mean-Risk Trade-Offs in Inductive Expert Systems , 2000, Inf. Syst. Res..

[44]  Foster J. Provost,et al.  Decision-Centric Active Learning of Binary-Outcome Models , 2007, Inf. Syst. Res..

[45]  E. H. Simpson,et al.  The Interpretation of Interaction in Contingency Tables , 1951 .

[46]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[47]  J. Heckman Sample selection bias as a specification error , 1979 .

[48]  Z. John Zhang,et al.  From Story Line to Box Office: A New Approach for Green-Lighting Movie Scripts , 2007, Manag. Sci..

[49]  Hon-Kwong Lui,et al.  Machine Learning for Direct Marketing Response Models: Bayesian Networks with Evolutionary Programming , 2006, Manag. Sci..

[50]  Elizabeth A Stuart,et al.  Improving propensity score weighting using machine learning , 2010, Statistics in medicine.

[51]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[52]  Padmini Srinivasan,et al.  Predicting Web Page Status , 2008, Inf. Syst. Res..

[53]  Shenyang Guo,et al.  Propensity Score Analysis: Statistical Methods and Applications , 2014 .

[54]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[55]  Ravi Bapna,et al.  Do Your Online Friends Make You Pay ? A Randomized Field Experiment in an Online Music Social Network , 2012 .

[56]  Wisdom J. Tettey,et al.  African states, bureaucratic culture and computer fixes , 2001 .

[57]  Sumit Sarkar,et al.  The Role of the Management Sciences in Research on Personalization , 2003, Manag. Sci..

[58]  S. Schneeweiss,et al.  Evaluating uses of data mining techniques in propensity score estimation: a simulation study , 2008, Pharmacoepidemiology and drug safety.

[59]  Bandula Jayatilaka,et al.  Information systems outsourcing: a survey and analysis of the literature , 2004, DATB.

[60]  Ning Gao What Does Stock and Accounting Performance Tell Us About Outsourcing , 2005 .

[61]  Leslie Burkholder Philosophy and the Computer , 1992 .

[62]  Daniel Westreich,et al.  Propensity score estimation : machine learning and classification methods as alternatives to logistic regression , 2010 .

[63]  Harry Zhang,et al.  A Fast Decision Tree Learning Algorithm , 2006, AAAI.

[64]  Richard Heeks Understanding e-Governance for Development , 2001 .

[65]  K. Kunisaki,et al.  Simpson's paradox. , 2005, Critical care medicine.

[66]  Daniel Westreich,et al.  Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. , 2010, Journal of clinical epidemiology.

[67]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[68]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[69]  Petra E. Todd,et al.  Matching As An Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme , 1997 .

[70]  R. Lalonde Evaluating the Econometric Evaluations of Training Programs with Experimental Data , 1984 .

[71]  Andrew B. Whinston,et al.  Outsourcing Contracts and Equity Prices , 2013, Inf. Syst. Res..

[72]  Galit Shmueli,et al.  Research Commentary - Too Big to Fail: Large Samples and the p-Value Problem , 2013, Inf. Syst. Res..

[73]  Anitesh Barua,et al.  Contracting Efficiency and New Firm Survival in Markets Enabled by Information Technology , 2011, Inf. Syst. Res..

[74]  Sumit Sarkar,et al.  Privacy Protection in Data Mining: A Perturbation Approach for Categorical Data , 2006, Inf. Syst. Res..

[75]  Shirley Gregor,et al.  A classification tree analysis of broadband adoption in Australian households , 2004, ICEC '04.

[76]  Jee-Seon Kim,et al.  Abstract: Data Mining Alternatives to Logistic Regression for Propensity Score Estimation: Neural Networks and Support Vector Machines , 2013, Multivariate behavioral research.

[77]  Sean J. Taylor,et al.  Social Influence Bias: A Randomized Experiment , 2013, Science.

[78]  Xiaogang Su,et al.  Tree-structured analysis of treatment effects with large observational data , 2012 .

[79]  Forrest V. Morgeson,et al.  Does E‐Government Measure Up to E‐Business? Comparing End User Perceptions of U.S. Federal Government and E‐Business Web Sites , 2009 .

[80]  S. Domínguez-Almendros,et al.  Logistic regression models. , 2011, Allergologia et immunopathologia.

[81]  Rajeev Dehejia,et al.  Propensity Score-Matching Methods for Nonexperimental Causal Studies , 2002, Review of Economics and Statistics.

[82]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[83]  Yong Tan,et al.  Social Networks and the Diffusion of User-Generated Content: Evidence from YouTube , 2012, Inf. Syst. Res..

[84]  Fred D. Davis Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology , 1989, MIS Q..

[85]  Sang Pil Han,et al.  An Empirical Analysis of User Content Generation and Usage Behavior on the Mobile Internet , 2011, Manag. Sci..

[86]  E. Fama,et al.  Common risk factors in the returns on stocks and bonds , 1993 .

[87]  Dylan Walker,et al.  Creating Social Contagion Through Viral Product Design: A Randomized Trial of Peer Influence in Networks , 2010, ICIS.