Interactive Data Integration through Smart Copy & Paste

In many scenarios, such as emergency response or ad hoc collaboration, it is critical to reduce the overhead in integrating data. Here, the goal is often to rapidly integrate “enough” data to answer a specific question. Ideally, one could perform the entire process interactively under one unified interface: defining extractors and wrappers for sources, creating a mediated schema, and adding schema mappings — while seeing how these impact the integrated view of the data, and refining the design accordingly. We propose a novel smart copy and paste (SCP) model and architecture for seamlessly combining the design-time and run-time aspects of data integration, and we describe an initial prototype, the CopyCat system. In CopyCat, the user does not need special tools for the different stages of integration: instead, the system watches as the user copies data from applications (including the Web browser) and pastes them into CopyCat’s spreadsheet-like workspace. CopyCat generalizes these actions and presents proposed auto-completions, each with an explanation in the form of provenance. The user provides feedback on these suggestions — through either direct interactions or further copy-and-paste operations — and the system learns from this feedback. This paper provides an overview of our prototype system, and identifies key research challenges in achieving SCP in its full generality.

[1]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[2]  Steven Minton,et al.  AutoFeed: an unsupervised learning system for generating webfeeds , 2005, K-CAP '05.

[3]  David R. Karger,et al.  Potluck: Data mash-up tool for casual users , 2008, J. Web Semant..

[4]  Jens Dittrich,et al.  iTrails: Pay-as-you-go Information Integration in Dataspaces , 2007, VLDB.

[5]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[6]  Craig A. Knoblock,et al.  Building data integration queries by demonstration , 2007, IUI '07.

[7]  Tessa A. Lau,et al.  Sheepdog: learning procedures for technical support , 2004, IUI '04.

[8]  Jie Zhao,et al.  Schema Mediation in Peer Data Management Systems , 2011, Int. J. Cooperative Inf. Syst..

[9]  Jennifer Widom,et al.  Lineage tracing in data warehouses , 2001 .

[10]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[11]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[12]  Craig A. Knoblock,et al.  Learning Semantic Definitions of Online Information Sources , 2007, J. Artif. Intell. Res..

[13]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[14]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[15]  Kristina Lerman,et al.  Exploiting Data Semantics to Discover, Extract, and Model Web Sources , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[16]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[17]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.

[18]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[19]  Wang Chiew Tan,et al.  Debugging schema mappings with routes , 2006, VLDB.

[20]  Moshé M. Zloof Query-by-Example: A Data Base Language , 1977, IBM Syst. J..

[21]  AnHai Doan,et al.  Matching Schemas in Online Communities: A Web 2.0 Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[23]  Yannick Saillet,et al.  Transformation Rule Discovery through Data Mining , 2008, NTII.

[24]  Koby Crammer,et al.  Learning to create data-integrating queries , 2008, Proc. VLDB Endow..

[25]  Craig A. Knoblock,et al.  Building Mashups by example , 2008, IUI '08.

[26]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[27]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[28]  Carole A. Goble,et al.  The Data Playground: An Intuitive Workflow Specification Environment , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[29]  Henry Lieberman,et al.  Watch what I do: programming by demonstration , 1993 .

[30]  Pedro M. Domingos,et al.  Programming by demonstration: a machine learning approach , 2001 .

[31]  Atsushi Sugiura,et al.  Internet scrapbook: automating Web browsing tasks by demonstration , 1998, UIST '98.

[32]  Partha Pratim Talukdar,et al.  The ORCHESTRA Collaborative Data Sharing System , 2008, SIGMOD Rec..

[33]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[34]  Kristina Lerman,et al.  Semantic Labeling of Online Information Sources , 2007, Int. J. Semantic Web Inf. Syst..

[35]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.