Mining big data to extract patterns and predict real-life outcomes.

This article aims to introduce the reader to essential tools that can be used to obtain insights and build predictive models using large data sets. Recent user proliferation in the digital environment has led to the emergence of large samples containing a wealth of traces of human behaviors, communication, and social interactions. Such samples offer the opportunity to greatly improve our understanding of individuals, groups, and societies, but their analysis presents unique methodological challenges. In this tutorial, we discuss potential sources of such data and explain how to efficiently store them. Then, we introduce two methods that are often employed to extract patterns and reduce the dimensionality of large data sets: singular value decomposition and latent Dirichlet allocation. Finally, we demonstrate how to use dimensions or clusters extracted from data to build predictive models in a cross-validated way. The text is accompanied by examples of R code and a sample data set, allowing the reader to practice the methods discussed here. A companion website (http://dataminingtutorial.com) provides additional learning resources. (PsycINFO Database Record

[1]  C. B. Colby The weirdest people in the world , 1973 .

[2]  L. R. Goldberg THE DEVELOPMENT OF MARKERS FOR THE BIG-FIVE FACTOR STRUCTURE , 1992 .

[3]  Susan C. Herring,et al.  Linguistic and Critical Analysis of Computer-Mediated Communication: Some Ethical and Scholarly Considerations , 1996, Inf. Soc..

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  Andy P. Field,et al.  Discovering Statistics Using SPSS , 2000 .

[6]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[7]  David M. Pennock,et al.  Methods and metrics for cold-start recommendations , 2002, SIGIR '02.

[8]  H. Abdi Factor Rotations in Factor Analyses , 2003 .

[9]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Tracy Hall,et al.  Ethical Issues in Software Engineering Research: A Survey of Current Practice , 2001, Empirical Software Engineering.

[12]  A. Bruckman Studying the amateur artist: A perspective on disguising data collected in human subjects research on the Internet , 2002, Ethics and Information Technology.

[13]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[15]  L. R. Goldberg Presidential Paper Doing it all Bass-Ackwards: The development of hierarchical factor structures from the top down , 2006 .

[16]  John A. Johnson,et al.  The international personality item pool and the future of public-domain personality measures ☆ , 2006 .

[17]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[18]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[19]  Zhengyuan Zhu,et al.  Singular Value Decomposition and Its Visualization , 2007 .

[20]  Kimberly A. Barchard,et al.  Practical advice for conducting ethical online experiments and questionnaires for United States psychologists , 2008, Behavior research methods.

[21]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[22]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[23]  Lauren B Solberg Data Mining on Facebook: A Free Space for Researchers or an IRB Nightmare? , 2010 .

[24]  David M. Blei,et al.  Hierarchical relational models for document networks , 2009, 0909.4331.

[25]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[28]  D. Dittrich,et al.  Computer Science Security Research and Human Subjects: Emerging Considerations for Research Ethics Boards , 2011, Journal of Empirical Research on Human Research Ethics.

[29]  Lars Backstrom,et al.  The Anatomy of the Facebook Social Graph , 2011, ArXiv.

[30]  Richard O. Mason,et al.  Studying cyborgs: re-examining internet studies as human subjects research , 2012, J. Inf. Technol..

[31]  Jeremy Miles,et al.  Discovering statistics using R, 1st Edition , 2012 .

[32]  Lindsay T. Graham,et al.  A Review of Facebook Research in the Social Sciences , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[33]  Jure Leskovec,et al.  No country for old members: user lifecycle and linguistic change in online communities , 2013, WWW.

[34]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[35]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[36]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[37]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[38]  Renaud Lambiotte,et al.  Tracking the Digital Footprints of Personality , 2014, Proceedings of the IEEE.

[39]  Jeffrey T. Hancock,et al.  Experimental evidence of massive-scale emotional contagion through social networks , 2014, Proceedings of the National Academy of Sciences.

[40]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[41]  S. Gosling,et al.  Facebook as a research tool for the social sciences: Opportunities, challenges, ethical considerations, and practical guidelines. , 2015, The American psychologist.

[42]  M. Kosinski,et al.  Computer-based personality judgments are more accurate than those made by humans , 2015, Proceedings of the National Academy of Sciences.

[43]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[44]  Gregory J. Park,et al.  Psychological Language on Twitter Predicts County-Level Heart Disease Mortality , 2015, Psychological science.

[45]  A. Purpose 4.3 Singular Value Decomposition and Analysis , 2022 .