Repairing Data through Regular Expressions

Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair. For sequence repair, we propose the Regular-expression-based Structural Repair (RSR in short) algorithm. RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between a prefix of the input string and a partial pattern regular expression with time complexity of O(nm2) and space complexity of O(mn) where m is the edge number of NFA and n is the input string length. We also develop an optimization strategy to achieve higher performance for long strings. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the proper method. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.

[1]  Robert P. Goldman,et al.  Imputation of Missing Data Using Machine Learning Techniques , 1996, KDD.

[2]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[3]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[4]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[5]  Xin He,et al.  Scalar aggregation in inconsistent databases , 2003, Theor. Comput. Sci..

[6]  L. Kohn,et al.  To Err Is Human : Building a Safer Health System , 2007 .

[7]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[8]  Wen-Syan Li,et al.  Top-k string similarity search with edit-distance constraints , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[10]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[11]  Ahmad Fadzil M. Hani,et al.  Missing Attribute Value Prediction Based on Artificial Neural Network and Rough Set Theory , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[12]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[13]  P. Maurette [To err is human: building a safer health system]. , 2002, Annales francaises d'anesthesie et de reanimation.

[14]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[16]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[17]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[18]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[19]  Wenfei Fan,et al.  Inferring data currency and consistency for conflict resolution , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).