Data cleaning techniques for software engineering data sets

Data quality is an important issue which has been addressed and recognised in research communities such as data warehousing, data mining and information systems. It has been agreed that poor data quality will impact the quality of results of analyses and that it will therefore impact on decisions made on the basis of these results. Empirical software engineering has neglected the issue of data quality to some extent. This fact poses the question of how researchers in empirical software engineering can trust their results without addressing the quality of the analysed data. One widely accepted definition for data quality describes it as ‘fitness for purpose’, and the issue of poor data quality can be addressed by either introducing preventative measures or by applying means to cope with data quality issues. The research presented in this thesis addresses the latter with the special focus on noise handling. Three noise handling techniques, which utilise decision trees, are proposed for application to software engineering data sets. Each technique represents a noise handling approach: robust filtering, where training and test sets are the same; predictive filtering, where training and test sets are different; and filtering and polish, where noisy instances are corrected. The techniques were first evaluated in two different investigations by applying them to a large real world software engineering data set. In the first investigation the techniques’ ability to improve predictive accuracy in differing noise levels was tested. All three techniques improved predictive accuracy in comparison to the do-nothing approach. The filtering and polish was the most successful technique in improving predictive accuracy. The second investigation utilising the large real world software engineering data set tested the techniques’ ability to identify instances with implausible values. These instances were flagged for the purpose of evaluation before applying the three techniques. Robust filtering and predictive filtering decreased the number of instances with implausible values, but substantially decreased the size of the data set too. The filtering and polish technique actually increased the number of implausible values, but it did not reduce the size of the data set. Since the data set contained historical software project data, it was not possible to know

[1]  David J. Hand,et al.  How to lie with bad data , 2005 .

[2]  Philip M. Johnson,et al.  The Personal Software Process: A Cautionary Case Study , 1998, IEEE Softw..

[3]  Zeeshan Muzaffar,et al.  Handling imprecision and uncertainty in software development effort prediction: A type-2 fuzzy logic based framework , 2009, Inf. Softw. Technol..

[4]  Balachander Krishnamurthy,et al.  Collaborating against common enemies , 2005, IMC '05.

[5]  Emilia Mendes,et al.  Effort estimation: how valuable is it for a web company to use a cross-company data set, compared to using its own single-company data set? , 2007, WWW '07.

[6]  Günther Ruhe,et al.  Rough set-based data analysis in goal-oriented software measurement , 1996, Proceedings of the 3rd International Software Metrics Symposium.

[7]  Graeme Shanks,et al.  A Semiotic Information Quality Framework , 2004 .

[8]  Barbara Kitchenham,et al.  Procedures for Performing Systematic Reviews , 2004 .

[9]  Barbara A. Kitchenham,et al.  Using simulated data sets to compare data analysis techniques used for software cost modelling , 2001, IEE Proc. Softw..

[10]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[11]  Emilia Mendes,et al.  Replicating studies on cross- vs single-company effort models using the ISBSG Database , 2008, Empirical Software Engineering.

[12]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[13]  Ioannis Stamelos,et al.  Software productivity and effort prediction with ordinal regression , 2005, Inf. Softw. Technol..

[14]  N. Lavra,et al.  Experiments with noise detection algorithms inthe diagnosis of coronary artery diseaseD , 2022 .

[15]  Premkumar T. Devanbu,et al.  Fair and balanced?: bias in bug-fix datasets , 2009, ESEC/FSE '09.

[16]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[17]  Yongji Wang,et al.  Capability Assessment of Individual Software Development Processes Using Software Repositories and DEA , 2008, ICSP.

[18]  Norman E. Fenton,et al.  A Critique of Software Defect Prediction Models , 1999, IEEE Trans. Software Eng..

[19]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[20]  J. Sim,et al.  The kappa statistic in reliability studies: use, interpretation, and sample size requirements. , 2005, Physical therapy.

[21]  Heiko Mueller,et al.  Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[22]  Gavin R. Finnie,et al.  Estimating software development effort with connectionist models , 1997, Inf. Softw. Technol..

[23]  Saso Dzeroski,et al.  Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois , 1996, ALT.

[24]  Parag C. Pendharkar,et al.  An exploratory study of object-oriented software component size determinants and the application of regression tree forecasting models , 2004, Inf. Manag..

[25]  Philip B. Crosby,et al.  Quality Without Tears : The Art of Hassle-Free Management , 2011 .

[26]  George Loizou,et al.  Quality of manual data collection in Java software: an empirical investigation , 2007, Empirical Software Engineering.

[27]  G. R. Finnie,et al.  AI tools for software development effort estimation , 1996, Proceedings 1996 International Conference Software Engineering: Education and Practice.

[28]  Ioannis Stamelos,et al.  Understanding knowledge sharing activities in free/open source software projects: An empirical study , 2008, J. Syst. Softw..

[29]  Barry Boehm,et al.  Software economics: a roadmap , 2000, ICSE '00.

[30]  Martin Hirzel,et al.  Data layouts for object-oriented programs , 2007, SIGMETRICS '07.

[31]  Martin Shepperd,et al.  Assessing the Quality and Cleaning of a Software Project Data Set: An Experience Report , 2006, EASE.

[32]  Taghi M. Khoshgoftaar,et al.  A Comparative Study of Ordering and Classification of Fault-Prone Software Modules , 1999, Empirical Software Engineering.

[33]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[34]  Jeffrey C. Carver,et al.  An empirical methodology for introducing software processes , 2001, ESEC/FSE-9.

[35]  Taghi M. Khoshgoftaar,et al.  Software quality estimation with limited fault data: a semi-supervised learning perspective , 2007, Software Quality Journal.

[36]  Taghi M. Khoshgoftaar,et al.  Software quality modeling: The impact of class noise on the random forest classifier , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[37]  Emilia Mendes,et al.  A Comparative Study of Cost Estimation Models for Web Hypermedia Applications , 2003, Empirical Software Engineering.

[38]  Martin J. Shepperd,et al.  Software project economics: a roadmap , 2007, Future of Software Engineering (FOSE '07).

[39]  R. Geoff Dromey,et al.  Software Quality—Prevention versus Cure? , 2003, Software Quality Journal.

[40]  Mira Mezini,et al.  VM performance evaluation with functional models: an optimist's outlook , 2009, VMIL '09.

[41]  Alain Abran,et al.  Functional Size Measurement Quality Challenges for Inexperienced Measurers , 2009, IWSM/Mensura.

[42]  Bhekisipho Twala,et al.  Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[43]  Taghi M. Khoshgoftaar,et al.  Imputation techniques for multivariate missingness in software measurement data , 2008, Software Quality Journal.

[44]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[45]  Luigi Lavazza Convertibility of functional size measurements: new insights and methodological issues , 2009, PROMISE '09.

[46]  Barbara A. Kitchenham,et al.  Empirical studies of assumptions that underlie software cost-estimation models , 1992, Inf. Softw. Technol..

[47]  Hans van Vliet,et al.  Measuring where it matters: Determining starting points for metrics collection , 2008, J. Syst. Softw..

[48]  Meir M. Lehman,et al.  Software Evolution and Software Evolution Processes , 2002, Ann. Softw. Eng..

[49]  Christof Ebert,et al.  Improving reliability of large software systems , 1999, Ann. Softw. Eng..

[50]  Kari Rönkkö,et al.  Reporting usability metrics experiences , 2009, 2009 ICSE Workshop on Cooperative and Human Aspects on Software Engineering.

[51]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[52]  Xuemei Zhang,et al.  Some successful approaches to software reliability modeling in industry , 2005, J. Syst. Softw..

[53]  Doo-Hwan Bae,et al.  A pattern-based outlier detection method identifying abnormal attributes in software project data , 2010, Inf. Softw. Technol..

[54]  Claes Wohlin,et al.  Applying sampling to improve software inspections , 2004, J. Syst. Softw..

[55]  Volkmar H. Haase Software process improvement planning with neural networks , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[56]  Premkumar T. Devanbu,et al.  Analytical and empirical evaluation of software reuse metrics , 1996, Proceedings of IEEE 18th International Conference on Software Engineering.

[57]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[58]  Taghi M. Khoshgoftaar,et al.  Ordering Fault-Prone Software Modules , 2003, Software Quality Journal.

[59]  Philip M. Johnson,et al.  Investigating data quality problems in the PSP , 1998, SIGSOFT '98/FSE-6.

[60]  Michel Manago,et al.  Noise and Knowledge Acquisition , 1987, IJCAI.

[61]  Choh-Man Teng,et al.  Combining Noise Correction with Feature Selection , 2003, DaWaK.

[62]  Ioannis Stamelos,et al.  Estimating the development cost of custom software , 2003, Inf. Manag..

[63]  William Marsh,et al.  On the effectiveness of early life cycle defect prediction with Bayesian Nets , 2008, Empirical Software Engineering.

[64]  Xiaogang Chen,et al.  Virtual organizational learning in open source software development projects , 2009, Inf. Manag..

[65]  Christophe Meudec,et al.  Automatic Test Data Generation from Embedded C Code , 2004, SAFECOMP.

[66]  N. Fenton,et al.  Project Data Incorporating Qualitative Factors for Improved Software Defect Prediction , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[67]  Mike Holcombe,et al.  Correctness of data mined from CVS , 2008, MSR '08.

[68]  Margaret M. Burnett,et al.  Mining problem-solving strategies from HCI data , 2010, TCHI.

[69]  Miguel-Ángel Sicilia,et al.  Analysis of Software Functional Size Databases , 2007, IWSM/Mensura.

[70]  Elaine J. Weyuker,et al.  Predicting the location and number of faults in large software systems , 2005, IEEE Transactions on Software Engineering.

[71]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[72]  Taghi M. Khoshgoftaar,et al.  Generating multiple noise elimination filters with the ensemble-partitioning filter , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[73]  Andreas Zeller,et al.  eROSE: guiding programmers in eclipse , 2005, OOPSLA '05.

[74]  Taghi M. Khoshgoftaar,et al.  Rule-based noise detection for software measurement data , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[75]  Nada Lavrac,et al.  Noise Detection and Elimination Applied to Noise Handling in a KRK Chess Endgame , 1996, Inductive Logic Programming Workshop.

[76]  Taghi M. Khoshgoftaar,et al.  Identifying noise in an attribute of interest , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[77]  Honggang Wang,et al.  User preferences based software defect detection algorithms selection using MCDM , 2012, Inf. Sci..

[78]  Barbara A. Kitchenham,et al.  An empirical analysis of software productivity over time , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[79]  Boudewijn F. van Dongen,et al.  Process Mining Framework for Software Processes , 2007, ICSP.

[80]  Barry W. Boehm,et al.  Finding the right data for software cost modeling , 2005, IEEE Software.

[81]  Horst Lichter,et al.  Evaluating Process Quality Based on Change Request Data - An Empirical Study of the Eclipse Project , 2009, IWSM/Mensura.

[82]  Michael Gertz,et al.  Report on the Dagstuhl Seminar , 2004, SGMD.

[83]  Philip M. Johnson,et al.  We need more coverage, stat! classroom experience with the software ICU , 2009, ESEM 2009.

[84]  Witold Pedrycz,et al.  A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[85]  Taghi M. Khoshgoftaar,et al.  Unsupervised learning for expert-based software quality estimation , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[86]  John C. Munson,et al.  Toward a quantifiable definition of software faults , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[87]  Elaine J. Weyuker,et al.  Where the bugs are , 2004, ISSTA '04.

[88]  Volker Nannen,et al.  The paradox of overfitting , 2003 .

[89]  Reidar Conradi,et al.  An empirical study of variations in COTS-based software development processes in the Norwegian IT industry , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[90]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2008, IEEE Transactions on Software Engineering.

[91]  Taghi M. Khoshgoftaar,et al.  An empirical study of predicting software faults with case-based reasoning , 2006, Software Quality Journal.

[92]  Dirk Riehle,et al.  The commenting practice of open source , 2009, OOPSLA Companion.

[93]  John C. Munson,et al.  Software faults: A quantifiable definition , 2006, Adv. Eng. Softw..

[94]  Carla E. Brodley,et al.  Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[95]  Çigdem Gencel,et al.  Impact of Base Functional Component Types on Software Functional Size Based Effort Estimation , 2008, PROFES.

[96]  Ching-Hsue Cheng,et al.  Software Diagnosis Using Fuzzified Attribute Base on Modified MEPA , 2006, IEA/AIE.

[97]  Akif Günes Koru,et al.  Defect handling in medium and large open source projects , 2004, IEEE Software.

[98]  Taghi M. Khoshgoftaar,et al.  Noise Correction using Bayesian Multiple Imputation , 2006, 2006 IEEE International Conference on Information Reuse & Integration.

[99]  Martin J. Shepperd,et al.  Software productivity analysis of a large data set and issues of confidentiality and data quality , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[100]  Erik Arisholm,et al.  Empirical assessment of the impact of structural properties on the changeability of object-oriented software , 2006, Inf. Softw. Technol..

[101]  Thong Ngee Goh,et al.  A study of project selection and feature weighting for analogy based software cost estimation , 2009, J. Syst. Softw..

[102]  Parag C. Pendharkar,et al.  An empirical study of the Cobb-Douglas production function properties of software development effort , 2008, Inf. Softw. Technol..

[103]  R. Gulezian,et al.  Software quality measurement and modeling, maturity, control and improvement , 1995, Proceedings of Software Engineering Standards Symposium.

[104]  Philip M. Johnson Leap: a "personal information environment" for software engineers , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[105]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[106]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[107]  Carolyn Seaman,et al.  Gauging acceptance of software metrics: Comparing perspectives of managers and developers , 2009, ESEM 2009.

[108]  Chao Liu,et al.  Recovering Relationships between Documentation and Source Code based on the Characteristics of Software Engineering , 2009, Electron. Notes Theor. Comput. Sci..

[109]  Ernesto Damiani,et al.  Discovering the software process by means of stochastic workflow analysis , 2006, J. Syst. Archit..

[110]  Yu-Jen Liu,et al.  A comparative evaluation on the accuracies of software effort estimates from clustered data , 2008, Inf. Softw. Technol..

[111]  Barbara A. Kitchenham,et al.  A Further Empirical Investigation of the Relationship Between MRE and Project Size , 2003, Empirical Software Engineering.

[112]  Taghi M. Khoshgoftaar,et al.  Resource oriented selection of rule-based classification models: An empirical case study , 2006, Software Quality Journal.

[113]  Ioannis Stamelos,et al.  Combining probabilistic models for explanatory productivity estimation , 2008, Inf. Softw. Technol..

[114]  T. H. Tse,et al.  Fault localization through evaluation sequences , 2010, J. Syst. Softw..

[115]  Christof Ebert Experiences with criticality predictions in software development , 1997, ESEC '97/FSE-5.

[116]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[117]  Gina Venolia,et al.  The secret life of bugs: Going past the errors and omissions in software repositories , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[118]  Michiel van Genuchten,et al.  Targets, drivers and metrics in software process improvement: Results of a survey in a multinational organization , 2006, Software Quality Journal.

[119]  Muhammad Ali Babar,et al.  Systematic literature reviews in software engineering: Preliminary results from interviews with researchers , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[120]  Taghi M. Khoshgoftaar,et al.  A comprehensive empirical evaluation of missing value imputation in noisy software measurement data , 2008, J. Syst. Softw..

[121]  Taghi M. Khoshgoftaar,et al.  The necessity of assuring quality in software measurement data , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[122]  Hongfang Liu,et al.  Theory of relative defect proneness , 2008, Empirical Software Engineering.

[123]  R. Buehler,et al.  Planning, personality, and prediction: The role of future focus in optimistic time predictions☆ , 2003 .

[124]  Grant Braught,et al.  Core empirical concepts and skills for computer science , 2004 .

[125]  Jouni Lappalainen,et al.  Tool Support for Personal Software Process , 2005, PROFES.

[126]  Yun Yang,et al.  Empirical Study on Benchmarking Software Development Tasks , 2007, ICSP.

[127]  Barry W. Boehm,et al.  Phase distribution of software development effort , 2008, ESEM '08.

[128]  Raymond J. Madachy,et al.  Empirical Studies of Evolving Systems , 2004, Empirical Software Engineering.

[129]  Sandro Morasca,et al.  A hybrid approach to analyze empirical software engineering data and its application to predict module fault-proneness in maintenance , 2000, J. Syst. Softw..

[130]  Anders Wesslén,et al.  A Replicated Empirical Study of the Impact of the Methods in the PSP on Individual Engineers , 2000, Empirical Software Engineering.

[131]  Witold Pedrycz,et al.  An Investigation on the Occurrence of Service Requests in Commercial Software Applications , 2004, Empirical Software Engineering.

[132]  Lefteris Angelis,et al.  Categorical missing data imputation for software cost estimation by multinomial logistic regression , 2006, J. Syst. Softw..

[133]  Çigdem Gencel,et al.  Do Base Functional Component Types Affect the Relationship between Software Functional Size and Effort? , 2007, IWSM/Mensura.

[134]  Emilia Mendes,et al.  Cross-company and single-company effort models using the ISBSG database: a further replicated study , 2006, ISESE '06.

[135]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[136]  Tzvi Raz,et al.  Comparison of estimation methods of cost and duration in IT projects , 2009, Inf. Softw. Technol..

[137]  Stuart Hansen,et al.  Engagement and frustration in programming projects , 2007, SIGCSE '07.

[138]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[139]  N. Fenton,et al.  Modelling Prior Productivity and Defect Rates in a Causal Model for Software Project Risk Assessment , 2007 .

[140]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[141]  Stefan Biffl,et al.  Increasing the accuracy and reliability of analogy-based cost estimation with extensive project feature dimension weighting , 2004, Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE '04..

[142]  Taghi M. Khoshgoftaar,et al.  Improving Software Quality Prediction by Noise Filtering Techniques , 2007, Journal of Computer Science and Technology.

[143]  J. Moses,et al.  Bayesian probability distributions for assessing measurement of subjective software attributes , 2000, Inf. Softw. Technol..

[144]  Barry W. Boehm,et al.  Productivity trends in incremental and iterative software development , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[145]  Ioannis Stamelos,et al.  Identifying knowledge brokers that yield software engineering knowledge in OSS projects , 2006, Inf. Softw. Technol..

[146]  Roland Ducournau,et al.  Empirical assessment of object-oriented implementations with multiple inheritance and static typing , 2009, OOPSLA 2009.

[147]  Audris Mockus,et al.  Software Support Tools and Experimental Work , 2006, Empirical Software Engineering Issues.

[148]  Stuart E. Madnick,et al.  Data quality requirements analysis and modeling , 2011, Proceedings of IEEE 9th International Conference on Data Engineering.

[149]  Taghi M. Khoshgoftaar,et al.  Noise identification with the k-means algorithm , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[150]  David Lo,et al.  Extracting Paraphrases of Technical Terms from Noisy Parallel Software Corpora , 2009, ACL.

[151]  Akbar Siami Namin,et al.  Sufficient mutation operators for measuring test effectiveness , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[152]  Romain Robbes,et al.  SpyWare: a change-aware development toolset , 2008, ICSE '08.

[153]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[154]  Tim Menzies,et al.  On the value of combining feature subset selection with genetic algorithms: faster learning of coverage models , 2009, PROMISE '09.

[155]  Jason Denton,et al.  A Software Implementation Progress Model , 2006, FASE.

[156]  Taghi M. Khoshgoftaar,et al.  Software Quality Imputation in the Presence of Noisy Data , 2006, 2006 IEEE International Conference on Information Reuse & Integration.

[157]  Stan Matwin,et al.  Machine Learning Method for Software Quality Model Building , 1999, ISMIS.

[158]  Ray Horak Webster's New World Telecom Dictionary , 2007 .

[159]  Ioannis Stamelos,et al.  A statistical framework for analyzing the duration of software projects , 2008, Empirical Software Engineering.

[160]  Philip M. Johnson Reengineering inspection , 1998, CACM.

[161]  Thomas Zimmermann,et al.  Preprocessing CVS Data for Fine-Grained Analysis , 2004, MSR.

[162]  Richard Y. Wang,et al.  Data Quality , 2000, Advances in Database Systems.

[163]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[164]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[165]  Frank Schweitzer,et al.  Software change dynamics: evidence from 35 java projects , 2009, ESEC/FSE '09.

[166]  Andrian Marcus,et al.  Data Cleansing: A Prelude to Knowledge Discovery , 2005, Data Mining and Knowledge Discovery Handbook.

[167]  Onur Demirörs,et al.  An experimental study on the conversion between IFPUG and COSMIC functional size measurement units , 2010, Inf. Softw. Technol..

[168]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[169]  Ahmed E. Hassan,et al.  Identifying crosscutting concerns using historical code changes , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[170]  Ioannis Stamelos,et al.  Regression via Classification applied on software defect estimation , 2008, Expert Syst. Appl..

[171]  Çigdem Gencel,et al.  What Are the Significant Cost Drivers for COSMIC Functional Size Based Effort Estimation? , 2009, IWSM/Mensura.

[172]  Martin J. Shepperd,et al.  Comparing Software Prediction Techniques Using Simulation , 2001, IEEE Trans. Software Eng..

[173]  Richi Nayak,et al.  Use of Data Mining in System Development Life Cycle , 2006, Selected Papers from AusDM.

[174]  Reidar Conradi,et al.  Quality, productivity and economic benefits of software reuse: a review of industrial studies , 2007, Empirical Software Engineering.

[175]  Parag C. Pendharkar,et al.  The relationship between software development team size and software development cost , 2009, CACM.

[176]  Isabella Wieczorek Improved Software Cost Estimation – A Robust and Interpretable Modelling Method and a Comprehensive Empirical Investigation , 2004, Empirical Software Engineering.

[177]  Jean-Marc Desharnais,et al.  Estimating Software Development Effort with Case-Based Reasoning , 1997, ICCBR.

[178]  Martin Shepperd,et al.  Data Sets and Data Quality in Software Engineering: Eight Years On , 2016, PROMISE.

[179]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[180]  Philip M. Johnson,et al.  A Critical Analysis of PSP Data Quality: Results from a Case Study , 1999, Empirical Software Engineering.

[181]  Phillip G. Armour Software: hard data , 2006, CACM.

[182]  J. Ouellette,et al.  Abandoning Unrealistic Optimism: Performance Estimates and the Temporal Proximity of Self-Relevant Feedback , 1996 .

[183]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[184]  Stefan Biffl,et al.  Using a Reliability Growth Model to Control Software Inspection , 2002, Empirical Software Engineering.

[185]  Ralph Kimball,et al.  Dealing with dirty data , 1996 .

[186]  Christof Ebert Technical controlling and software process improvement , 1999, J. Syst. Softw..

[187]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[188]  Choh-Man Teng Evaluating Noise Correction , 2000, PRICAI.

[189]  Brian P. Bailey,et al.  Understanding and developing models for detecting and differentiating breakpoints during interactive tasks , 2007, CHI.

[190]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[191]  Amrit L. Goel,et al.  Modeling Software Component Criticality Using a Machine Learning Approach , 2004, AIS.

[192]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[193]  Abraham Bernstein,et al.  Software process data quality and characteristics: a historical view on open and closed source projects , 2009, IWPSE-Evol '09.

[194]  Juan Julián Merelo Guervós,et al.  Beyond source code: The importance of other artifacts in software development (a case study) , 2006, J. Syst. Softw..

[195]  Emilia Mendes,et al.  Applying moving windows to software effort estimation , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[196]  Emilia Mendes The Use of a Bayesian Network for Web Effort Estimation , 2007, ICWE.

[197]  Sun-Jen Huang,et al.  Optimization of analogy weights by genetic algorithm for software effort estimation , 2006, Inf. Softw. Technol..

[198]  Forrest Shull,et al.  Defect categorization: making use of a decade of widely varying historical data , 2008, ESEM '08.

[199]  Yashwant K. Malaiya,et al.  Enhancing accuracy of software reliability prediction , 1993, Proceedings of 1993 IEEE International Symposium on Software Reliability Engineering.

[200]  Toni Granollers,et al.  Enhancing usability testing through datamining techniques: A novel approach to detecting usability problem patterns for a context of use , 2008, Inf. Softw. Technol..

[201]  Myra B. Cohen,et al.  A self-adjusting code cache manager to balance start-up time and memory usage , 2010, CGO '10.

[202]  Jeffrey J. P. Tsai,et al.  Machine Learning and Software Engineering , 2002, 14th IEEE International Conference on Tools with Artificial Intelligence, 2002. (ICTAI 2002). Proceedings..

[203]  Yongji Wang,et al.  Evaluation of the Capability of Personal Software Process Based on Data Envelopment Analysis , 2005, ISPW.

[204]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[205]  Audris Mockus,et al.  Succession: Measuring transfer of code and developer productivity , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[206]  Raymund Sison,et al.  Personal software process (PSP) assistant , 2005, 12th Asia-Pacific Software Engineering Conference (APSEC'05).

[207]  Wendy A. Kellogg,et al.  Task and social visualization in software development: evaluation of a prototype , 2007, CHI.

[208]  Jean-Marc Desharnais,et al.  A comparison of software effort estimation techniques: Using function points with neural networks, case-based reasoning and regression models , 1997, J. Syst. Softw..

[209]  Barry Boehm,et al.  Unifying the Software Process Spectrum, International Software Process Workshop, SPW 2005, Beijing, China, May 25-27, 2005, Revised Selected Papers , 2005, ISPW.