Using unreliable data for creating more reliable online learners

Some machine learning applications involve the question of whether or not to use unreliable data for the learning. Previous work shows that learners trained using unreliable data in addition to reliable data present either similar or worse performance than learners trained solely on reliable data. Such learners frequently use unreliable data as if they were reliable and consider only the offline learning scenario. The present paper shows that it is possible to use unreliable data to improve the performance in online learning scenarios with a pre-existing set of unreliable data. We propose an approach called Dynamic Un+Reliable data learners (DUR) able to determine when unreliable data could be useful by maintaining a fixed size weighted memory of unreliable data learners. The weights represent how well learners perform for the current concept and are updated throughout DUR's lifetime. This approach manages not only to outperform an approach which uses only reliable data, but also an approach which uses unreliable data as if they were reliable. Moreover, the variance in performance is reduced in comparison to the approach which uses only reliable data. In other words, DUR is a more reliable learner.

[1]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[2]  Tim Menzies,et al.  On the Value of Ensemble Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[3]  Xin Yao,et al.  A principled evaluation of ensembles of learning machines for software effort estimation , 2011, Promise '11.

[4]  Xin Yao,et al.  DDD: A New Ensemble Approach for Dealing with Concept Drift , 2012, IEEE Transactions on Knowledge and Data Engineering.

[5]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[6]  Xin Yao,et al.  The Impact of Diversity on Online Ensemble Learning in the Presence of Concept Drift , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[8]  Qinbao Song,et al.  Dealing with missing software project data , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Karen T. Lum,et al.  Selecting Best Practices for Effort Estimation , 2006, IEEE Transactions on Software Engineering.

[11]  Sohini Roy Chowdhury,et al.  Prediction of electric power consumption for commercial buildings , 2011, The 2011 International Joint Conference on Neural Networks.

[12]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[13]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[14]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[15]  H. H. Ku,et al.  Contributions to Probability and Statistics, Essays in Honor of Harold Hotelling. , 1961 .

[16]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[17]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[18]  Jacob H. Stang,et al.  Load prediction method for heat and electricity demand in buildings for the purpose of planning for mixed energy distribution systems , 2008 .

[19]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .