SETAP: Software engineering teamwork assessment and prediction using machine learning

Effective teaching of teamwork skills in local and globally distributed Software Engineering (SE) teams is recognized as an important part of the education of current and future software engineers. Effective methods for assessment and early prediction of learning effectiveness in SE teamwork are not only a critical part of teaching but also of value in industrial training and project management. This paper presents a novel analytical approach to the assessment and, most importantly, the prediction of learning outcomes in SE teamwork based on data from our joint software engineering class concurrently taught at San Francisco State University (SFSU), Florida Atlantic University (FAU) and Fulda University, Germany (Fulda). Our approach focuses on assessment and prediction of SE teamwork in terms of ability of student teams to apply best SE processes and develop SE products. It differs from existing work in the following aspects: a) it develops and uses only objective and quantitative measures of team activity from multiple sources, such as statistics of student time use, software engineering tool use, and instructor observations; b) it leverages powerful machine learning (ML) techniques applied to team activity measurements to identify quantitative and objective factors which can assess and predict learning of software engineering teamwork skills at the team level. In this paper we provide the following contributions: a) we present in detail for the first time the full team activity measurement data set we developed, consisting of over 40 objective and quantitative measures extracted from student teams working on class projects; b) we present a ML framework which applies the Random Forest (RF) algorithm to the team activity measurements and team outcomes, focusing on predicting teams that are likely to fail; c) we describe in detail our now fully implemented and operational data processing pipeline, consisting of data collection methods from multiple sources, ML training database creation, and ML analysis subsystems; and finally d) we present very preliminary results of ML analysis results based on the data from our joint software engineering classes in Fall 2012, and Spring 2013, with the data from 17 student teams. While our ML training database is currently small, it continuously grows. Our preliminary results, verified with two independent accuracy measures, show that RF is able to predict SE Process and SE Product team performance in intuitively explainable manner.

[1]  Dragutin Petkovic,et al.  Assessment and comparison of local and global SW engineering practices in a classroom setting , 2008, ITiCSE.

[2]  D. Petkovic,et al.  Teaching Practical Software Engineering and Global Software Engineering: Case Study and Recommendations , 2006, Proceedings. Frontiers in Education. 36th Annual Conference.

[3]  Capers Jones,et al.  Why software fails , 1996 .

[4]  B. Rannala Bioinformatics: The Machine Learning Approach.Second Edition. Adaptive Computation and Machine Learning. ByPierre Baldiand, Sørenv Brunak.A Bradford Book. Cambridge (Massachusetts): MIT Press. $49.95. xxiii + 452 p; ill.; index. ISBN: 0–262–02506‐X. 2001. , 2002 .

[5]  Roger Pressman,et al.  Software Engineering: A Practitioner's Approach, 7Th Edition , 2009 .

[6]  G. Thompson,et al.  Work in progress — e-TAT: Online tool for teamwork and “soft skills” assessment in software engineering education , 2010, 2010 IEEE Frontiers in Education Conference (FIE).

[7]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[8]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[9]  Dragutin Petkovic,et al.  Teaching practical software engineering and global software engineering: evaluation and comparison , 2006, ITICSE '06.

[10]  Petra Perner,et al.  Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal Mining , 2006 .

[11]  Dennis E. Slice,et al.  Bioinformatics: The Machine Learning Approach. Adaptive Computation and Machine Learning.Pierre Baldi , Soren Brunak , 1998 .

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Shihong Huang,et al.  A Machine Learning Approach for Assessment and Prediction of Teamwork Effectiveness in Software Engineering Education , 2012 .

[14]  Uday R. Kulkarni,et al.  Critical success factors for software projects , 1998, ICIS '98.

[15]  R.N. Charette,et al.  Why software fails [software failure] , 2005, IEEE Spectrum.

[16]  Russ B. Altman,et al.  High Precision Prediction of Functional Sites in Protein Structures , 2014, PloS one.

[17]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[18]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[19]  Jeffrey J. P. Tsai,et al.  Machine learning applications in software engineering , 2005 .

[20]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[21]  Shihong Huang,et al.  Work in progress: A machine learning approach for assessment and prediction of teamwork effectiveness in software engineering education , 2012, 2012 Frontiers in Education Conference Proceedings.

[22]  Roger S. Pressman,et al.  Software Engineering: A Practitioner's Approach , 1982 .

[23]  Blaize Horner Reich,et al.  The impact of size and volatility on IT project performance , 2007, CACM.

[24]  Bill Curtis,et al.  A field study of the software design process for large systems , 1988, CACM.