Pay-as-you-go information integration in personal and social dataspaces

A personal and social dataspace is the set of all information pertaining to a given user. It includes a heterogeneous mix of files, folders, email, contacts, music, calendar items, images, among others, distributed among a set of data sources such as filesystems, email servers, network shares, databases, and web servers. In addition, it includes all connections this user has to other users in a number of online services such as social networking web sites. In spite of a personal and social dataspace being richly heterogeneous and distributed, only very limited tools are available to aid users manage their information and have a unified view over it. At one extreme, search engines allow users to pose simple keyword and path searches over all of their data sources. These systems, however, return only ranked lists of besteffort results and provide limited or no means for users to increase the quality of query results returned by the system over time. At the other extreme, systems built on top of classic database technology, such as traditional information-integration systems, provide precise query semantics for queries over a set of data sources. Although the quality of query results returned by these systems is high, they are typically restricted to a subset of the personal information of a user, given the need to specify complex schema mappings to integrate the data. As a consequence, these systems have limited coverage and provide equally limited support for non-expert users to refine their view of their personal information over time. This thesis investigates a new breed of information-integration architecture that stands in-between the two extremes of search engines and traditional information-integration systems. We term this new type of system a Personal Dataspace Management System (PDSMS). Like a search engine, when a PDSMS is bootstrapped, it provides a simple search service over all of the user’s dataspace. In contrast to search engines, however, the PDSMS represents data not at the coarse-grained level of files (or text documents), but rather using a fine-grained graph-based data model. In addition, a PDSMS provides means for a user to increase the level of integration of her dataspace gradually, in a pay-as-you-go fashion. That is done by enabling users to provide simple integration “hints” that allow the PDSMS to improve the quality of query results. In contrast to traditional information-integration systems, however, at no point does the PDSMS require users to specify a global mediated schema for their information. We make four main contributions to the design of PDSMSs. First, we propose the iMeMex Data Model (iDM), a simple, yet powerful, graph-based data model able to represent the heterogeneous data mix found in a personal and social dataspace. Our data model enables query capabilities on top of the user’s dataspace not commonly found in state-of-the-art tools.

[1]  Marcos Antonio Vaz Salles,et al.  Towards Autonomic Index Maintenance , 2006, SBBD.

[2]  Susan T. Dumais,et al.  Fast, Flexible Filtering with Phlat — Personal Search and Organization Made Easy , 2006 .

[3]  Jens Dittrich,et al.  From Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System , 2007, BTW.

[4]  Tim Kraska,et al.  Building a database on S3 , 2008, SIGMOD Conference.

[5]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[6]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[7]  Kevin Chen-Chuan Chang,et al.  Mind your vocabulary: query mapping across heterogeneous information sources , 1999, SIGMOD '99.

[8]  Jens Dittrich,et al.  iDM: a unified and versatile data model for personal dataspace management , 2006, VLDB.

[9]  Sven Helmer,et al.  Full-fledged algebraic XPath processing in Natix , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Björn Þór Jónsson,et al.  Performance tradeoffs for client-server query processing , 1996, SIGMOD '96.

[11]  Gerhard Weikum,et al.  TopX: efficient and versatile top-k query processing for semistructured data , 2007, The VLDB Journal.

[12]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[13]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[14]  Oren Etzioni,et al.  Crossing the Structure Chasm , 2003, CIDR.

[15]  Norman Louat,et al.  The evaluation of , 1974 .

[16]  Thomas Neumann,et al.  Efficient generation and execution of DAG-structured query graphs , 2005 .

[17]  N. Sandlin PAY AS YOU GO , 1989 .

[18]  Laks V. S. Lakshmanan,et al.  Colorful XML: one hierarchy isn't enough , 2004, SIGMOD '04.

[19]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[20]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[21]  Daniel J. Abadi,et al.  Performance tradeoffs in read-optimized databases , 2006, VLDB.

[22]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[23]  Michael J. Carey,et al.  Query processing in the aqualogic data services platform , 2006, VLDB.

[24]  Vishu Krishnamurthy,et al.  Performance Challenges in Object-Relational DBMSs , 1999, IEEE Data Eng. Bull..

[25]  Ben Shneiderman,et al.  Response time and display rate in human performance with computers , 1984, CSUR.

[26]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[27]  Guy M. Lohman,et al.  Differential files: their application to the maintenance of large databases , 1976, TODS.

[28]  Nigel Shadbolt,et al.  Resource Description Framework (RDF) , 2009 .

[29]  AnHai Doan,et al.  iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[30]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[31]  Sihem Amer-Yahia,et al.  XML search: languages, INEX and scoring , 2006, SGMD.

[32]  Alon Y. Halevy,et al.  Enterprise information integration: successes, challenges and controversies , 2005, SIGMOD '05.

[33]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[34]  Tom M. Mitchell,et al.  Inferring Ongoing Activities of Workstation Users by Clustering Email , 2004, CEAS.

[35]  Seungyeop Han,et al.  Analysis of topological characteristics of huge online social networking services , 2007, WWW '07.

[36]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[37]  David Maier,et al.  Smoothing the ROI Curve for Scientific Data Management Applications , 2007, CIDR.

[38]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[39]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[40]  Richard Boardman,et al.  Improving Tool Support for Personal Information Management , 2004 .

[41]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[42]  Andrew McCallum,et al.  The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email , 2005 .

[43]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[44]  Rogério Luís de Carvalho Costa,et al.  Implementation of an Agent Architecture for Automated Index Tuning , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[45]  Donald Kossmann,et al.  AGILE: adaptive indexing for context-aware information filters , 2005, SIGMOD '05.

[46]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[47]  Andrew McCallum,et al.  Extracting social networks and contact information from email and the Web , 2004, CEAS.

[48]  Umeshwar Dayal,et al.  Processing Queries Over Generalization Hierarchies in a Multidatabase System , 1983, VLDB.

[49]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[50]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[51]  Beng Chin Ooi,et al.  The Claremont report on database research , 2008, SGMD.

[52]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[53]  Marcos Antonio,et al.  iMeMex: A Platform for Personal Dataspace Management , 2006 .

[54]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[55]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[56]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[57]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[58]  David J. DeWitt,et al.  Nested loops revisited , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[59]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[60]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[61]  Donald Kossmann,et al.  Bringing Precision to Desktop Search: A Predicate-based Desktop Search Architecture , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[62]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[63]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[64]  Cong Yu,et al.  Querying Complex Structured Databases , 2007, VLDB.

[65]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[66]  Rogério Luís de Carvalho Costa,et al.  Index Self-tunning with Agent-based Databases , 2018, CLEI Electron. J..

[68]  Alon Y. Halevy,et al.  Malleable Schemas: A Preliminary Report , 2005, WebDB.

[69]  Dennis McLeod,et al.  A Personal Data Manager , 1984, VLDB.

[70]  Hanan Samet,et al.  Spatial join techniques , 2007, TODS.

[71]  David Gelernter,et al.  Lifestreams: a storage model for personal data , 1996, SGMD.

[72]  Alon Y. Halevy,et al.  Indexing dataspaces , 2007, SIGMOD '07.

[73]  R. G. G. Cattell,et al.  Recent books , 2000, IEEE Spectrum.

[74]  Daniela Florescu,et al.  Storing and Querying XML Data using an RDMBS , 1999, IEEE Data Eng. Bull..

[75]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[76]  Philip S. Yu,et al.  Dual Labeling: Answering Graph Reachability Queries in Constant Time , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[77]  Felix Naumann,et al.  Quality-driven Integration of Heterogenous Information Systems , 1999, VLDB.

[78]  Serge Abiteboul,et al.  Exchanging intensional XML data , 2003, TODS.

[79]  Todd D. Millstein,et al.  Navigational Plans For Data Integration , 1999, AAAI/IAAI.

[80]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[81]  Gang Luo Efficient detection of empty-result queries , 2006, VLDB.

[82]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[83]  Alon Y. Halevy,et al.  A Platform for Personal Information Management and Integration , 2005, CIDR.

[84]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[85]  Ulf Leser,et al.  Fast and practical indexing and querying of very large graphs , 2007, SIGMOD '07.

[86]  Beng Chin Ooi,et al.  PeerDB: a P2P-based system for distributed data sharing , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[87]  Vannevar Bush,et al.  As we may think , 1945, INTR.

[88]  Gerhard Weikum,et al.  The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents , 2005, VLDB.

[89]  Peter M. Schwarz,et al.  The Rufus System: Information Organization for Semi-Structured Data , 1993, VLDB.

[90]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[91]  Donald Kossmann,et al.  iMeMex: Escapes from the Personal Information Jungle , 2005, VLDB.

[92]  Jennifer Widom,et al.  The Lowell database research self-assessment , 2003, CACM.

[93]  Serge Abiteboul,et al.  On views and XML , 1999, PODS '99.

[94]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[95]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[96]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[97]  Floris Geerts,et al.  MONDRIAN: Annotating and Querying Databases through Colors and Blocks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[98]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[99]  Gerhard Weikum,et al.  A Database Striptease or How to Manage Your Personal Databases , 2003, VLDB.

[100]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[101]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[102]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[103]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[104]  HalevyAlon,et al.  From databases to dataspaces , 2005 .

[105]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[106]  Gerhard Weikum,et al.  An Efficient and Versatile Query Engine for TopX Search , 2005, VLDB.

[107]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[108]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[109]  Mitul Tiwari,et al.  Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails , 2000, VLDB.

[110]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[111]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[112]  Marc Najork,et al.  Hits on the web: how does it compare? , 2007, SIGIR.

[113]  Ning Li,et al.  Hubble: An Advanced Dynamic Folder Technology for XML , 2005, VLDB.

[114]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[115]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[116]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[117]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[118]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[119]  Lois M. L. Delcambre,et al.  Superimposed Information for the Internet , 1999, WebDB.

[120]  Alon Y. Halevy,et al.  Efficient query reformulation in peer data management systems , 2004, SIGMOD '04.

[121]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[122]  Tom M. Mitchell,et al.  Learning to Classify Email into “Speech Acts” , 2004, EMNLP.

[123]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[124]  Jens Dittrich,et al.  Adding structure to web search with itrails [position paper] , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[125]  Ralf Schenkel,et al.  Feedback-Driven Structural Query Expansion for Ranked Retrieval of XML Data , 2006, EDBT.

[126]  Dan Suciu,et al.  Containment and equivalence for an XPath fragment , 2002, PODS.

[127]  Setrag Khoshafian,et al.  A decomposition storage model , 1985, SIGMOD Conference.

[128]  Jens Dittrich,et al.  Dwarfs in the rearview mirror: how big are they really? , 2008, Proc. VLDB Endow..

[129]  Russell Linden,et al.  From Vision to Reality , 1990 .

[130]  Valery Soloviev A truncating hash algorithm for processing band-join queries , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[131]  Hamid Pirahesh,et al.  Extensible/rule based query rewrite optimization in Starburst , 1992, SIGMOD '92.

[132]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[133]  Magdalena Balazinska,et al.  Homeviews: peer-to-peer middleware for personal data sharing applications , 2007, SIGMOD '07.

[134]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[135]  Jens Dittrich,et al.  A Dataspace Odyssey: The iMeMex Personal Dataspace Management System (Demo) , 2007, CIDR.

[136]  Ioana Manolescu,et al.  Dynamic XML documents with distribution and replication , 2003, SIGMOD '03.

[137]  Christopher D. Manning,et al.  DEMOS , 2009 .

[138]  Wenfei Fan,et al.  Putting context into schema matching , 2006, VLDB.

[139]  Li Chen,et al.  Stack-based Algorithms for Pattern Matching on DAGs , 2005, VLDB.

[140]  Craig A. Knoblock,et al.  Retrieving and Integrating Data from Multiple Information Sources , 1993, Int. J. Cooperative Inf. Syst..

[141]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[142]  Jens Dittrich,et al.  iTrails: Pay-as-you-go Information Integration in Dataspaces , 2007, VLDB.

[143]  Patrick Valduriez,et al.  Join indices , 1987, TODS.

[144]  Stanley B. Zdonik,et al.  “Data in your face”: push technology in perspective , 1998, SIGMOD '98.

[145]  Divesh Srivastava,et al.  Intensional associations between data and metadata , 2007, SIGMOD '07.

[146]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[147]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.