Using Machine Learning to Recommend Correctness Checks for Geographic Map Data

Developing an industry application that serves geographic map data to users across the world presents the significant challenge of checking the data using "data correctness checks." The size of data that needs to be checked-the entire world-and data churn rate-thousands per day-makes executing the full set of candidate checks cost prohibitive. Current techniques rely on hand-curated static subsets of checks to be run at different stages of the data production pipeline, These hard-coded subsets are uninformed of data changes, and cause bug detection to be delayed to downstream quality assurance activities. To address these problems, we have developed new representations of map data changes and checks, formally defined "check safety," and built a recommender system that dynamically and automatically selects and ranks a relevant subset of checks using signals from latest data changes. Empirical evaluation shows that it improves (1) efficiency by eliminating 65% of checks unrelated to changes, (2) coverage by recommending and ranking change-related checks from the full set of candidate checks, previously excluded by the manual process, and (3) overall visibility into the data editing process by quickly and automatically identifying latest fault prone parts of the data.

[1]  Gregg Rothermel,et al.  Prioritizing test cases for regression testing , 2000, ISSTA '00.

[2]  Alessandro Orso,et al.  Regression testing in the presence of non-code changes , 2011, 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation.

[3]  Ramzi A. Haraty,et al.  Regression Test Selection for Database Applications , 2004, Advanced Topics in Database Research, Vol. 3.

[4]  Song Wang,et al.  QTEP: quality-aware test case prioritization , 2017, ESEC/SIGSOFT FSE.

[5]  Lu Zhang,et al.  Test Case Prioritization for Compilers: A Text-Vector Based Approach , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[6]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[7]  Henning Christiansen,et al.  On Simplification of Database Integrity Constraints , 2006, Fundam. Informaticae.

[8]  Raúl H. Rosero,et al.  Regression Testing of Database Applications Under an Incremental Software Development Setting , 2017, IEEE Access.

[9]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10]  Jean-Marie Nicolas Logic for improving integrity checking in relational data bases⋆ , 2004, Acta Informatica.

[11]  Radziah Mohamad,et al.  Effective Regression Test Case Selection , 2017, ACM Comput. Surv..

[12]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[13]  Claudia Bauzer Medeiros,et al.  Providing multi-scale consistency for multi-scale geospatial data , 2013, SSDBM.

[14]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[15]  Maral Azizi,et al.  A collaborative filtering recommender system for test case prioritization in web applications , 2018, SAC.

[16]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[17]  Luc De Raedt,et al.  Learning constraints in spreadsheets and tabular data , 2017, Machine Learning.

[18]  Michael F. Goodchild,et al.  Geographical information science , 1992, Int. J. Geogr. Inf. Sci..

[19]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[20]  Keng Siau,et al.  Advanced Topics In Database Research , 2005 .

[21]  Hansi Senaratne,et al.  A review of volunteered geographic information quality assessment methods , 2017, Int. J. Geogr. Inf. Sci..

[22]  Suzanne M. Embury,et al.  A safe regression test selection technique for database-driven applications , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[23]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[24]  Francisco Javier Ariza-López,et al.  A Survey of Measures and Methods for Matching Geospatial Vector Datasets , 2016, ACM Comput. Surv..

[25]  Jean-Daniel Zucker,et al.  A data‐mining approach for assessing consistency between multiple representations in spatial databases , 2009, Int. J. Geogr. Inf. Sci..

[26]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[27]  Tao Xie,et al.  Learning for test prioritization: an industrial case study , 2016, SIGSOFT FSE.

[28]  Mark Harman,et al.  Regression testing minimization, selection and prioritization: a survey , 2012, Softw. Test. Verification Reliab..

[29]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[30]  Morten Mossige,et al.  Reinforcement learning for automatic test case prioritization and selection in continuous integration , 2017, ISSTA.

[31]  Ahmed Loai Ali,et al.  Data Quality Assurance for Volunteered Geographic Information , 2014, GIScience.

[32]  John Micco,et al.  Taming Google-Scale Continuous Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).