Automating pattern discovery for rule based data standardization systems

Data quality is a perennial problem for many enterprise data assets. To improve data quality, businesses often employ rule based data standardization systems in which domain experts code rules for handling important and prevalent patterns. Finding these patterns is laborious and time consuming, particularly for noisy or highly specialized data sets. It is also subjective to the persons determining these patterns. In this paper we present a tool to automatically mine patterns that can help in improving the efficiency and effectiveness of these data standardization systems. The automatically extracted patterns are used by the domain and knowledge experts for rule writing. We use a greedy algorithm to extract patterns that result in a maximal coverage of data. We further group the extracted patterns such that each group represents patterns that capture similar domain knowledge. We propose a similarity measure that uses input pattern semantics to group these patterns. We demonstrate the effectiveness of our method for standardization tasks on three real world datasets.

[1]  J. M. Arriola,et al.  Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages , 1998, ACL.

[2]  Naohiko Uramoto Positioning Unknown Words in a Thesaurus by Using Information Extracted from a Corpus , 1996, COLING.

[3]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Simon L. Kendal,et al.  An introduction to knowledge engineering , 2007 .

[6]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[7]  S. Griffis EDITOR , 1997, Journal of Navigation.

[8]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[9]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[10]  James Bailey,et al.  An Efficient Technique for Mining Approximately Frequent Substring Patterns , 2007 .

[11]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[12]  Lynette Hirschman,et al.  Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3) , 1993, CL.

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Udo Hahn,et al.  Towards Text Knowledge Engineering , 1998, AAAI/IAAI.

[15]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[16]  Jan Hajic,et al.  Serial Combination of Rules and Statistics: A Case Study in Czech Tagging , 2001, ACL.

[17]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[18]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[19]  Brian D. Davison,et al.  Hybrid semantic tagging for information extraction , 2005, WWW '05.

[20]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.