Keyword searches in data-centric XML documents using tree partitioning

Abstract This paper presents an effective keyword search method for data-centric extensive markup language (XML) documents. The method divides an XML document into compact connected integral subtrees, called self-integral trees (SI-Trees), to capture the structural information in the XML document. The SI-Trees are generated based on a schema guide. Meaningful self-integral trees (MSI-Trees) are identified, which contain all or some of the input keywords for the keyword search in the XML documents. Indexing is used to accelerate the retrieval of MSI-Trees related to the input keywords. The MSI-Trees are ranked to identify the top-k results with the highest ranks. Extensive tests demonstrate that this method costs 10–100 ms to answer a keyword query, and outperforms existing approaches by 1–2 orders of magnitude.

[1]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[2]  H. V. Jagadish,et al.  NaLIX: an interactive natural language interface for querying XML , 2005, SIGMOD '05.

[3]  Yin Yang,et al.  Keyword search on relational data streams , 2007, SIGMOD '07.

[4]  Lin Guo,et al.  Topology Search over Biological Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Guoliang Li,et al.  Efficient Keyword Search over Data-Centric XML Documents , 2007, APWeb/WAIM.

[6]  Feng Lin,et al.  Progressive Ranking for Efficient Keyword Search over Relational Databases , 2008, BNCOD.

[7]  Sujeet Pradhan,et al.  An algebraic query model for effective and efficient retrieval of XML fragments , 2006, VLDB.

[8]  Sihem Amer-Yahia,et al.  Expressiveness and Performance of Full-Text Search Languages , 2006, EDBT.

[9]  Clement T. Yu,et al.  Effective keyword search in relational databases , 2006, SIGMOD Conference.

[10]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[11]  Shan Wang,et al.  Finding Top-k Min-Cost Connected Trees in Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Vagelis Hristidis,et al.  Keyword proximity search on XML graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[13]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[14]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[15]  Divesh Srivastava,et al.  Keyword proximity search in XML trees , 2006 .

[16]  Beng Chin Ooi,et al.  EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data , 2008, SIGMOD Conference.

[17]  H. V. Jagadish,et al.  Constructing a Generic Natural Language Interface for an XML Database , 2006, EDBT.

[18]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[19]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[20]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Yannis Papakonstantinou,et al.  Efficient keyword search for smallest LCAs in XML databases , 2005, SIGMOD '05.

[22]  Luis Gravano,et al.  Efficient Keyword Search Across Heterogeneous Relational Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[23]  Sihem Amer-Yahia,et al.  Structure and Content Scoring for XML , 2005, VLDB.

[24]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[26]  Sihem Amer-Yahia,et al.  XQuery Full-Text extensions explained , 2006, IBM Syst. J..

[27]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[28]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[29]  Bei Yu,et al.  Race: finding and ranking compact connected trees for keyword proximity search over xml documents , 2008, WWW.

[30]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[31]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[32]  Sihem Amer-Yahia,et al.  Flexible and efficient XML search with complex full-text predicates , 2006, SIGMOD Conference.

[33]  Ronald Fagin,et al.  Fuzzy queries in multimedia database systems , 1998, PODS '98.

[34]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[35]  Jianyong Wang,et al.  Sailer: an effective search engine for unified retrieval of heterogeneous xml and web documents , 2008, WWW.

[36]  Krishna Bharat,et al.  Supporting cooperative and personal surfing with a desktop assistant , 1997, UIST '97.

[37]  Yehoshua Sagiv,et al.  Interconnection semantics for keyword search in XML , 2005, CIKM '05.

[38]  Yi Chen,et al.  Identifying meaningful return information for XML keyword search , 2007, SIGMOD '07.

[39]  Guoliang Li,et al.  Retune: Retrieving and Materializing Tuple Units for Effective Keyword Search over Relational Databases , 2008, ER.

[40]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[41]  Yehoshua Sagiv,et al.  Finding and approximating top-k answers in keyword proximity search , 2006, PODS '06.

[42]  Sihem Amer-Yahia,et al.  Adaptive processing of top-k queries in XML , 2005, 21st International Conference on Data Engineering (ICDE'05).