3SEPIAS: A Semi-Structured Search Engine for Personal Information in dAtaspace System

Nowadays, personal information is being distributed into more and more heterogeneous sources, which presents a huge obstacle to management and retrieval of personal information. To address this problem, this paper presents the blueprint of a novel Personal Information Management (PIM) system named 3SEPIAS (short for Semi-Structured Search Engine for Personal Information in dAtaspace System). 3SEPIAS has three main features, data integration without upfront semantic reconciliation, flexible query model for data having sparse and evolving schema, and efficient best-effort proximity search approach on graphs. For that, we first propose a semi-structured graph data model called Interpreted Object Model (IOM) to uniformly represents a user's heterogeneous personal information and loosely integrates it into a dataspace in a schema-later way. Then, a Semi-Structured Search Engine (3SE) can be used to search over the personal dataspaces. We propose an intuitive 3SE Query Language (3SQL) that enables users to query in a varying degree of structural constraint according to their knowledge of underlying schemas. Moreover, a best-effort top-k proximity search optimization strategy and corresponding graph index structures are proposed to improve the efficiency of query processing. We perform comprehensive experiments to test both effectiveness and efficiency of our proximity search approach. The results reveal that 3SE can beat the previous proximity search systems by a large margin with only a little or even no loss of result quality, especially for large graphs.

[1]  Gordon Bell,et al.  MyLifeBits: fulfilling the Memex vision , 2002, MULTIMEDIA '02.

[2]  Divyakant Agrawal,et al.  Retrieving and organizing web pages by “information unit” , 2001, WWW '01.

[3]  Jens Dittrich,et al.  iTrails: Pay-as-you-go Information Integration in Dataspaces , 2007, VLDB.

[4]  Ronald Fagin,et al.  Combining fuzzy information from multiple systems (extended abstract) , 1996, PODS.

[5]  Mengchi Liu,et al.  A Flexible Data Warehousing Approach for One-Stop Querying on Heterogeneous Personal Information , 2009, 2009 20th International Workshop on Database and Expert Systems Application.

[6]  David R. Karger,et al.  Haystack: A General-Purpose Information Management Tool for End Users Based on Semistructured Data , 2005, CIDR.

[7]  Gerhard Weikum,et al.  The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents , 2005, VLDB.

[8]  Sihem Amer-Yahia,et al.  Texquery: a full-text search extension to xquery , 2004, WWW '04.

[9]  Jeffrey F. Naughton,et al.  Extending RDBMSs To Support Sparse Datasets Using An Interpreted Attribute Storage Format , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  Zhanhuai Li,et al.  Semantic relevance ranking for XML keyword search , 2012, Inf. Sci..

[11]  Jens Dittrich,et al.  iDM: a unified and versatile data model for personal dataspace management , 2006, VLDB.

[12]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[13]  Pierre Jouvelot,et al.  Semantic file systems , 1991, SOSP '91.

[14]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[15]  Nicholas J. Belkin,et al.  Personal information management in the present and future perfect: Reports from a special NSF-sponsored workshop , 2005, ASIST.

[16]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[17]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[18]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[20]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[21]  Jun Rekimoto,et al.  Time-machine computing: a time-centric approach for the information environment , 1999, UIST '99.

[22]  Cong Yu,et al.  Querying structured text in an XML database , 2003, SIGMOD '03.

[23]  Gottfried Vossen,et al.  SISQL: schema-independent database querying (on and off the Web) , 2000, Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789).

[24]  Alon Y. Halevy,et al.  Indexing dataspaces , 2007, SIGMOD '07.

[25]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[26]  M. de Rijke,et al.  Structured queries in XML retrieval , 2005, CIKM '05.

[27]  Ingmar Weber,et al.  The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration , 2007, CIDR.

[28]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[29]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[30]  Yannis Papakonstantinou,et al.  Efficient keyword search for smallest LCAs in XML databases , 2005, SIGMOD '05.

[31]  Guoliang Li,et al.  SAIL: Structure-aware indexing for effective and progressive top-k keyword search over XML documents , 2009, Inf. Sci..

[32]  Paul Dourish,et al.  Extending document management systems with user-specific active properties , 2000, TOIS.

[33]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[34]  Setrag Khoshafian,et al.  A decomposition storage model , 1985, SIGMOD Conference.

[35]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[36]  Mengchi Liu,et al.  3se: a semi-structured search engine for heterogeneous data in graph model , 2009, CIKM.

[37]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[38]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[39]  Marcos Antonio,et al.  iMeMex: A Platform for Personal Dataspace Management , 2006 .

[40]  Jeffrey F. Naughton,et al.  A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data , 2007, VLDB.

[41]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[42]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[43]  David Gelernter,et al.  Lifestreams: a storage model for personal data , 1996, SGMD.

[44]  Mengchi Liu,et al.  Modeling heterogeneous data in dataspace , 2008, IRI.

[45]  Mengchi Liu,et al.  Efficient keyword proximity search using a frontier-reduce strategy based on d-distance graph index , 2009, IDEAS '09.

[46]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[47]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[48]  Donald Kossmann,et al.  iMeMex: Escapes from the Personal Information Jungle , 2005, VLDB.