Engineering big data solutions

Structured and unstructured data in operational support tools have long been prevalent in software engineering. Similar data is now becoming widely available in other domains. Software systems that utilize such operational data (OD) to help with software design and maintenance activities are increasingly being built despite the difficulties of drawing valid conclusions from disparate and low-quality data and the continuing evolution of operational support tools. This paper proposes systematizing approaches to the engineering of OD-based systems. To prioritize and structure research areas we consider historic developments, such as big data hype; synthesize defining features of OD, such as confounded measures and unobserved context; and discuss emerging new applications, such as diverse and large OD collections and extremely short development intervals. To sustain the credibility of OD-based systems more research will be needed to investigate effective existing approaches and to synthesize novel, OD-specific engineering principles.

[1]  Marc J. Rochkind,et al.  The source code control system , 1975, IEEE Transactions on Software Engineering.

[2]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[3]  Per Runeson,et al.  Roundtable: What's Next in Software Analytics , 2013, IEEE Software.

[4]  Audris Mockus,et al.  Interval Quality: Relating Customer-Perceived Quality to Process Quality , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[5]  Nathaniel Poor,et al.  Mechanisms of an Online Public Sphere: The Website Slashdot , 2006, J. Comput. Mediat. Commun..

[6]  Victor R. Basili,et al.  A Methodology for Collecting Valid Software Engineering Data , 1984, IEEE Transactions on Software Engineering.

[7]  C. Pollard,et al.  Center for the Study of Language and Information , 2022 .

[8]  Stephen G. Eick,et al.  Seesoft-A Tool For Visualizing Line Oriented Software Statistics , 1992, IEEE Trans. Software Eng..

[9]  Audris Mockus,et al.  Identifying Productivity Drivers by Modeling Work Units Using Partial Data , 2001, Technometrics.

[10]  James D. Herbsleb,et al.  Leveraging Transparency , 2013, IEEE Software.

[11]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .

[12]  Audris Mockus,et al.  Quantifying the Effect of Code Smells on Maintenance Effort , 2013, IEEE Transactions on Software Engineering.

[13]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[14]  Cemal Yilmaz,et al.  Software Metrics , 2008, Wiley Encyclopedia of Computer Science and Engineering.

[15]  Markus Grünwald,et al.  Business Intelligence , 2009, Informatik-Spektrum.

[16]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2000, The Kluwer International Series in Software Engineering.

[17]  Walter F. Tichy,et al.  Implementation and evaluation of a revision control system , 1982 .

[18]  Audris Mockus,et al.  Product assignment recommender , 2014, ICSE Companion.

[19]  Audris Mockus,et al.  Risky files: an approach to focus quality improvement effort , 2013, ESEC/FSE 2013.

[20]  Audris Mockus,et al.  Amassing and indexing a large sample of version control systems: Towards the census of public source code history , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[21]  Audris Mockus,et al.  Variability and Reproducibility in Software Engineering: A Study of Four Companies that Developed the Same System , 2009, IEEE Transactions on Software Engineering.

[22]  Z. Jelinski,et al.  Software reliability Research , 1972, Statistical Computer Performance Evaluation.

[23]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[24]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[25]  Audris Mockus,et al.  A case study of open source software development: the Apache server , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[26]  Premkumar T. Devanbu,et al.  Sample size vs. bias in defect prediction , 2013, ESEC/FSE 2013.

[27]  Thad Dunning,et al.  Natural Experiments in the Social Sciences , 2012 .

[28]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[29]  Helmut Krcmar,et al.  Big Data , 2014, Wirtschaftsinf..

[30]  Audris Mockus,et al.  Questioning software maintenance metrics: A comparative case study , 2012, Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement.

[31]  Audris Mockus,et al.  Software Dependencies, Work Dependencies, and Their Impact on Failures , 2009, IEEE Transactions on Software Engineering.

[32]  Dan Davison,et al.  A Multi-Language Computing Environment for Literate Programming and Reproducible Research , 2012 .

[33]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[34]  Michael W. Godfrey,et al.  Automated topic naming to support cross-project analysis of software maintenance activities , 2011, MSR '11.

[35]  Roger Clarke,et al.  Big Data's Big Unintended Consequences , 2013, Computer.

[36]  Leif Singer,et al.  The (R) Evolution of social media in software engineering , 2014, FOSE.

[37]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[38]  Claude E. Walston,et al.  A Method of Programming Measurement and Estimation , 1977, IBM Syst. J..

[39]  Audris Mockus,et al.  Software Support Tools and Experimental Work , 2006, Empirical Software Engineering Issues.

[40]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[41]  Audris Mockus,et al.  Missing Data in Software Engineering , 2008, Guide to Advanced Empirical Software Engineering.

[42]  Daniel M. Germán,et al.  The promises and perils of mining git , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[43]  Audris Mockus,et al.  Organizational volatility and its effects on software defects , 2010, FSE '10.

[44]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[45]  Andrew Begel,et al.  Analyze this! 145 questions for data scientists in software engineering , 2013, ICSE.

[46]  P. Rousseeuw,et al.  Wiley Series in Probability and Mathematical Statistics , 2005 .

[47]  Harald C. Gall,et al.  Detection of logical coupling based on product release history , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[48]  Anil K. Midha Software configuration management for the 21st century , 1997, Bell Labs Technical Journal.

[49]  Christian Bird,et al.  The effect of branching strategies on software quality , 2012, Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement.

[50]  Keith W. Miller,et al.  Big Data: New Opportunities and New Challenges [Guest editors' introduction] , 2013, Computer.

[51]  Tze-Jie Yu,et al.  An Analysis of Several Software Defect Models , 1988, IEEE Trans. Software Eng..

[52]  Audris Mockus,et al.  Assessing the state of software in a large enterprise , 2010, Empirical Software Engineering.