The Ensemble and Model Comparison Approaches for Big Data Analytics in Social Sciences.

Big data analytics are prevalent in fields like business, engineering, public health, and the physical sciences, but social scientists are slower than their peers in other fields in adopting this new methodology. One major reason for this is that traditional statistical procedures are typically not suitable for the analysis of large and complex data sets. Although data mining techniques could alleviate this problem, it is often unclear to social science researchers which option is the most suitable one to a particular research problem. The main objective of this paper is to illustrate how the model comparison of two popular ensemble methods, namely, boosting and bagging, could yield an improved explanatory model.

[1]  Carlos J. Costa,et al.  Data visualization , 2015, CDQR.

[2]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[3]  C. I. Mosier I. Problems and Designs of Cross-Validation 1 , 1951 .

[4]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[5]  Fabio Roli,et al.  Dynamics of Variance Reduction in Bagging and Other Techniques Based on Randomisation , 2005, Multiple Classifier Systems.

[6]  Faisal Zaman,et al.  Classification Performance of Bagging and Boosting Type Ensemble Methods with Small Training Sets , 2011, New Generation Computing.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[9]  Gitta H. Lubke,et al.  Finding structure in data using multivariate tree boosting , 2015, Psychological methods.

[10]  Chong Ho Yu Dancing with the Data: The Art and Science of Data Visualization , 2014 .

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  Donald K. Wedding,et al.  Discovering Knowledge in Data, an Introduction to Data Mining , 2005, Inf. Process. Manag..

[13]  Yuhong Yang Can the Strengths of AIC and BIC Be Shared , 2005 .

[14]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[15]  Mike W.-L. Cheung,et al.  Analyzing Big Data in Psychology: A Split/Analyze/Meta-Analyze Approach , 2016, Front. Psychol..

[16]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[17]  C. B. Colby The weirdest people in the world , 1973 .

[18]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[19]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[20]  Eric R. Zieyel The Collected Works of John W. Tukey , 1988 .

[21]  Kenneth C. Lichtendahl,et al.  Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner , 2016 .

[22]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[23]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[24]  Jure Leskovec,et al.  Mining big data to extract patterns and predict real-life outcomes. , 2016, Psychological methods.

[25]  Sotiris B. Kotsiantis,et al.  Bagging and boosting variants for handling classifications problems: a survey , 2013, The Knowledge Engineering Review.

[26]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[27]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[28]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[29]  Robert P. W. Duin,et al.  Bagging, Boosting and the Random Subspace Method for Linear Classifiers , 2002, Pattern Analysis & Applications.

[30]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[31]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[32]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[33]  L. Breiman Arcing Classifiers , 1998 .

[34]  Alan H. Fielding,et al.  Cluster and Classification Techniques for the Biosciences , 2006 .

[35]  Andrew B. Collmus,et al.  A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research. , 2016, Psychological methods.

[36]  Christophe Croux,et al.  Bagging and Boosting Classification Trees to Predict Churn , 2006 .

[37]  Andrew Dean Ho,et al.  Big Data Analysis in Higher Education: Promises and Pitfalls , 2016 .

[38]  Chun-Xia Zhang,et al.  Investigating the Effect of Randomly Selected Feature Subsets on Bagging and Boosting , 2015, Commun. Stat. Simul. Comput..

[39]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[40]  Shuo-Yan Chou,et al.  Enhancing the classification accuracy by scatter-search-based ensemble approach , 2011, Appl. Soft Comput..

[41]  A. K. Kurtz A Research Test of the Rorschach Test , 1948 .

[42]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[43]  R. Rosenthal,et al.  Statistical Procedures and the Justification of Knowledge in Psychological Science , 1989 .

[44]  Sandip Sinharay,et al.  An NCME Instructional Module on Data Mining Methods for Classification and Regression , 2016 .

[45]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[46]  Szymon Jaroszewicz,et al.  Ensemble methods for uplift modeling , 2014, Data Mining and Knowledge Discovery.

[47]  Shigeyuki Hamori,et al.  Ensemble Learning or Deep Learning? Application to Default Risk Analysis , 2018 .

[48]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[49]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[50]  D. Krus,et al.  Computer Assisted Multicrossvalidation in Regression Analysis , 1982 .