Experiences with using Data Cleaning Technology for Bing Services

Over the past few years, our Data Management, Exploration and Mining (DMX) group at Microsoft Research has worked closely with the Bing team to address challenging data cleaning and approximate matching problems. In this article we describe some of the key Big Data challenges in the context of these Bing services primarily focusing on two key services: Bing Maps and Bing Shopping. We describe ideas that proved crucial in helping meet the quality, performance and scalability goals demanded by these services. We also briefly reflect on the lessons learned and comment on opportunities for future work in data cleaning technology for Big Data.

[1]  Surajit Chaudhuri,et al.  Data cleaning in microsoft SQL server 2005 , 2005, SIGMOD '05.

[2]  Surajit Chaudhuri,et al.  Learning String Transformations From Examples , 2009, Proc. VLDB Endow..

[3]  Surajit Chaudhuri,et al.  Towards a Domain Independent Platform for Data Cleaning , 2011, IEEE Data Eng. Bull..

[4]  Surajit Chaudhuri,et al.  A framework for robust discovery of entity synonyms , 2012, KDD.

[5]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[6]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[7]  Yeye He,et al.  SEISA: set expansion by iterative similarity aggregation , 2011, WWW.

[8]  Surajit Chaudhuri,et al.  Transformation-based Framework for Record Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.