Towards methods for systematic research on big data

Big Data is characterized by the five V's - of Volume, Velocity, Variety, Veracity and Value. Research on Big Data, that is, the practice of gaining insights from it, challenges the intellectual, process, and computational limits of an enterprise. Leveraging the correct and appropriate toolset requires careful consideration of a large software ecosystem. Powerful algorithms exist, but the exploratory and often ad-hoc nature of analytic demands and a distinct lack of established processes and methodologies make it difficult for Big Data teams to set expectations or even create valid project plans. The exponential growth of data generated exceeds the capacity of humans to process it, and compels us to develop automated computing methods that require significant and expensive computing power in order to scale effectively. In this paper, we characterize data-driven practice and research and explore how we might design effective methods for systematizing such practice and research [19, 22]. Brief case studies are presented in order to ground our conclusions and insights.

[1]  Jeffrey S. Saltz,et al.  The need for new processes, methodologies and tools to support big data teams and improve big data project effectiveness , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[2]  A. M. Novikov,et al.  Research Methodology: From Philosophy of Science to Research Design , 2013 .

[3]  Ken W. Collier,et al.  Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing , 2011 .

[4]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[5]  Micha Elsner,et al.  TopChurn: Maximum Entropy Churn Prediction Using Topic Models Over Heterogeneous Signals , 2015, WWW.

[6]  Michael J. Franklin The Berkeley Data Analytics Stack: Present and future , 2013, Big Data 2013.

[7]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[8]  Marco D. Santambrogio,et al.  Runtime adaptation on dataflow HPC platforms , 2013, 2013 NASA/ESA Conference on Adaptive Hardware and Systems (AHS-2013).

[9]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[10]  M. Mehl,et al.  Handbook of research methods for studying daily life , 2012 .

[11]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[12]  Bernard Marr,et al.  Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance , 2015 .

[13]  JaatunMartin Gilje,et al.  Agile Software Development , 2002, Comput. Sci. Educ..

[14]  Vasant Dhar,et al.  Data science and prediction , 2012, CACM.

[15]  GhemawatSanjay,et al.  The Google file system , 2003 .

[16]  Antony Rowstron,et al.  Nobody ever got fired for using Hadoop on a cluster , 2012, HotCDP '12.

[17]  Mateo Valero,et al.  Moving from petaflops to petadata , 2013, CACM.

[18]  Daniel Muijs,et al.  Doing quantitative research in education with SPSS. 2nd edition , 2010 .

[19]  Daniel Muijs Introduction to Quantitative Research , 2004 .

[20]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[21]  Alistair Cockburn,et al.  Agile Software Development , 2001 .

[22]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[23]  M. Patton,et al.  Qualitative evaluation and research methods , 1992 .

[24]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[27]  Charles Anderson,et al.  The end of theory: The data deluge makes the scientific method obsolete , 2008 .

[28]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .