Adaptive ridge regression system for software cost estimating on multi-collinear datasets

Cost estimation is one of the most critical activities in software life cycle. In past decades, a number of techniques have been proposed for cost estimation. Linear regression is yet the most frequently applied method in the literature. However, a number of studies point out that linear regression is prone to low prediction accuracy. The low prediction accuracy is due to a number of reasons such as non-linearity and non-normality. One less addressed reason is the multi-collinearities which may lead to unstable regression coefficients. On the other hand, it has been reported that multi-collinearity spreads widely across the software engineering datasets. To tackle this problem and improve regression's accuracy, we propose a holistic problem-solving approach (named adaptive ridge regression system) integrating data transformation, multi-collinearity diagnosis, ridge regression technique and multi-objective optimization. The proposed system is tested on two real world datasets with the comparisons with OLS regression, stepwise regression and other machine learning methods. The results indicate that adaptive ridge regression system can significantly improve the performance of regressions on multi-collinear datasets and produce more explainable results than machine learning methods.

[1]  M. Hardy Regression with dummy variables , 1993 .

[2]  Helge Toutenburg,et al.  Role of Categorical Variables in Multicollinearity in the Linear Regression Model , 2007 .

[3]  Keith Phalp,et al.  An investigation of machine learning based prediction systems , 2000, J. Syst. Softw..

[4]  Colin J Burgess,et al.  Can genetic programming improve software effort estimation? A comparative evaluation , 2001, Inf. Softw. Technol..

[5]  Yun Zhou,et al.  Linear ridge regression with spatial constraint for generation of parametric images in dynamic positron emission tomography studies , 2001 .

[6]  Yu-Jen Liu,et al.  A comparative evaluation on the accuracies of software effort estimates from clustered data , 2008, Inf. Softw. Technol..

[7]  Shie-Yui Liong,et al.  Forecasting of hydrologic time series with ridge regression in feature space , 2007 .

[8]  John E. Gaffney,et al.  Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation , 1983, IEEE Transactions on Software Engineering.

[9]  A. Messac,et al.  The normalized normal constraint method for generating the Pareto frontier , 2003 .

[10]  Norman R. Draper,et al.  Applied regression analysis (2. ed.) , 1981, Wiley series in probability and mathematical statistics.

[11]  Harris Papadopoulos,et al.  Reliable Confidence Intervals for Software Effort Estimation , 2009, AIAI Workshops.

[12]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[13]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007, IEEE Transactions on Software Engineering.

[14]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[15]  Thomas P. Ryan,et al.  Modern Regression Methods , 1996 .

[16]  Martin T. Hagan,et al.  Neural network design , 1995 .

[17]  Y. Miyazaki,et al.  Robust regression for developing software estimation models , 1994, J. Syst. Softw..

[18]  Tzvi Raz,et al.  Comparison of estimation methods of cost and duration in IT projects , 2009, Inf. Softw. Technol..

[19]  D. Ross Jeffery,et al.  Analogy-X: Providing Statistical Inference to Analogy-Based Software Cost Estimation , 2008, IEEE Transactions on Software Engineering.

[20]  R. L. Hardy Multiquadric equations of topography and other irregular surfaces , 1971 .

[21]  L. Pettit,et al.  Conditioning Diagnostics: Collinearity and Weak Data in Regression , 1992 .

[22]  Taha B. M. J. Ouarda,et al.  Automated regression-based statistical downscaling tool , 2008, Environ. Model. Softw..

[23]  Magne Jørgensen,et al.  A review of studies on expert estimation of software development effort , 2004, J. Syst. Softw..

[24]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[25]  D. Ross Jeffery,et al.  Using public domain metrics to estimate software development effort , 2001, Proceedings Seventh International Software Metrics Symposium.

[26]  Martin J. Shepperd,et al.  Comparing Software Prediction Techniques Using Simulation , 2001, IEEE Trans. Software Eng..

[27]  Ioannis Stamelos,et al.  A Simulation Tool for Efficient Analogy Based Cost Estimation , 2000, Empirical Software Engineering.

[28]  Taha B. M. J. Ouarda,et al.  Comparison of ice-affected streamflow estimates computed using artificial neural networks and multiple regression techniques , 2008 .

[29]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[30]  Genny Tortora,et al.  Class point: an approach for the size estimation of object-oriented systems , 2005, IEEE Transactions on Software Engineering.

[31]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[32]  D. Ross Jeffery,et al.  A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data , 2000, Inf. Softw. Technol..

[33]  Thong Ngee Goh,et al.  A study of the non-linear adjustment for analogy based software cost estimation , 2009, Empirical Software Engineering.

[34]  Terri L. Moore,et al.  Regression Analysis by Example , 2001, Technometrics.

[35]  Emilia Mendes,et al.  Bayesian Network Models for Web Effort Prediction: A Comparative Study , 2008, IEEE Transactions on Software Engineering.

[36]  Claes Wohlin,et al.  Benchmarking k-nearest neighbour imputation with homogeneous Likert data , 2006, Empirical Software Engineering.

[37]  Thong Ngee Goh,et al.  A study of project selection and feature weighting for analogy based software cost estimation , 2009, J. Syst. Softw..

[38]  Ware Myers,et al.  Measures for Excellence: Reliable Software on Time, Within Budget , 1991 .

[39]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[40]  Lionel C. Briand,et al.  An assessment and comparison of common software cost estimation modeling techniques , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[41]  Barry W. Boehm,et al.  A constrained regression technique for cocomo calibration , 2008, ESEM '08.

[42]  Gerald J. Hahn,et al.  Applied Regression Analysis (2nd Ed.) , 2012 .

[43]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[44]  T. L. Saaty,et al.  The computational algorithm for the parametric objective function , 1955 .

[45]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[46]  Building a software cost estimation model based on categorical data , 2001, Proceedings Seventh International Software Metrics Symposium.

[47]  Emilia Mendes,et al.  Why comparative effort prediction studies may be invalid , 2009, PROMISE '09.

[48]  Saeed Parsa,et al.  Finding Causes of Software Failure Using Ridge Regression and Association Rule Generation Methods , 2008, 2008 Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing.

[49]  Qinbao Song,et al.  A new imputation method for small software project data sets , 2007, J. Syst. Softw..

[50]  Barbara A. Kitchenham,et al.  A Simulation Study of the Model Evaluation Criterion MMRE , 2003, IEEE Trans. Software Eng..

[51]  Neima Brauner,et al.  Role of range and precision of the independent variable in regression of data , 1998 .

[52]  John E. Dennis,et al.  Normal-Boundary Intersection: A New Method for Generating the Pareto Surface in Nonlinear Multicriteria Optimization Problems , 1998, SIAM J. Optim..

[53]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[54]  Marcos Dipinto,et al.  Discriminant analysis , 2020, Predictive Analytics.

[55]  Magne Jørgensen,et al.  The role of outcome feedback in improving the uncertainty assessment of software development effort estimates , 2008, TSEM.

[56]  Barbara A. Kitchenham,et al.  Using simulated data sets to compare data analysis techniques used for software cost modelling , 2001, IEE Proc. Softw..

[57]  Ioannis Stamelos,et al.  Software productivity and effort prediction with ordinal regression , 2005, Inf. Softw. Technol..

[58]  Stephen G. MacDonell,et al.  What accuracy statistics really measure , 2001, IEE Proc. Softw..

[59]  Parag C. Pendharkar,et al.  A Probabilistic Model for Predicting Software Development Effort , 2003, ICCSA.

[60]  E. GaffneyJ.,et al.  Software Function, Source Lines of Code, and Development Effort Prediction , 1983 .

[61]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[62]  Emilia Mendes,et al.  Do adaptation rules improve web cost estimation? , 2003, HYPERTEXT '03.

[63]  Magne Jørgensen,et al.  An empirical study of software maintenance tasks , 1995, J. Softw. Maintenance Res. Pract..

[64]  Jelle J Goeman,et al.  Autocorrelated Logistic Ridge Regression for Prediction Based on Proteomics Spectra , 2008, Statistical applications in genetics and molecular biology.

[65]  S. Chatterjee,et al.  Regression Analysis by Example , 1979 .

[66]  Stefan Biffl,et al.  Optimal project feature weights in analogy-based cost estimation: improvement and limitations , 2006, IEEE Transactions on Software Engineering.

[67]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[68]  Emilia Mendes,et al.  Investigating Web size metrics for early Web cost estimation , 2005, J. Syst. Softw..

[69]  Vivek Agarwal,et al.  Machine learning approach to color constancy , 2007, Neural Networks.

[70]  L. Leemis Applied Linear Regression Models , 1991 .

[71]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[72]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007 .

[73]  Vadlamani Ravi,et al.  Software development cost estimation using wavelet neural networks , 2008, J. Syst. Softw..

[74]  Abbas Heiat,et al.  Comparison of artificial neural network and regression models for estimating software development effort , 2002, Inf. Softw. Technol..