Intelligently creating and recommending reusable reformatting rules

When users combine data from multiple sources into a spreadsheet or dataset, the result is often a mishmash of different formats, since phone numbers, dates, course numbers and other string-like kinds of data can each be written in many different formats. Although spreadsheets provide features for reformatting numbers and a few specific kinds of string data, they do not provide any support for the wide range of other kinds of string data encountered by users. We describe a user interface where a user can describe the formats of each kind of data. We provide an algorithm that uses these formats to automatically generate reformatting rules that transform strings from one format to another. In effect, our system enables users to create a small expert system called a "tope" that can recognize and reformat instances of one kind of data. Later, as the user is working with a spreadsheet, our system recommends appropriate topes for validating and reformatting the data. With a recall of over 80% for a query time of under 1 second, this algorithm is accurate enough and fast enough to make useful recommendations in an interactive setting. A laboratory experiment shows that compared to manual typing, users can reformat sample spreadsheet data more than twice as fast by creating and using topes.

[1]  Henry Lieberman,et al.  Training Agents to Recognize Text by Example , 1999, AGENTS '99.

[2]  Rajeev Rastogi,et al.  RE-tree: an efficient index structure for regular expressions , 2003, The VLDB Journal.

[3]  Mary Shaw,et al.  Fast, Accurate Creation of Data Validation Formats by End-User Developers , 2009, IS-EUD.

[4]  M. Fisher,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, WEUSE@ICSE.

[5]  Alan F. Blackwell,et al.  SWYN: a visual representation for regular expressions , 2001 .

[6]  Christopher Scaffidi Unsupervised Inference of Data Formats in Human-Readable Notation , 2007, ICEIS.

[7]  F. Mosteller,et al.  Quantifying Probabilistic Expressions , 1990 .

[8]  Lotfi A. Zadeh,et al.  Fuzzy Logic , 2009, Encyclopedia of Complexity and Systems Science.

[9]  Martin Erwig,et al.  Header and Unit Inference for Spreadsheets Through Spatial Analyses , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[10]  David R. Karger,et al.  Potluck: Data mash-up tool for casual users , 2008, J. Web Semant..

[11]  Maria Jean Johnstone Hall A risk and control-oriented study of the practices of spreadsheet application developers , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[12]  Elaine Marsh,et al.  MUC-7 Evaluation of IE Technology: Overview of Results , 1998, MUC.

[13]  Robert P. Nix,et al.  Editing by example , 1984 .

[14]  Mary Shaw,et al.  Topes , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[15]  Rob Miller,et al.  Outlier finding: focusing user attention on possible errors , 2001, UIST '01.