Distributed XML Twig Query Processing Using MapReduce

Twig query processing is one of the core operations of XML queries. Centralized holistic twig algorithms suffer great efficiency losses when large-scale XML documents are partitioned and stored in the cloud. Previous work on distributed twig query processing have some limitations, e.g., utter dependence on priori knowledge of query patterns, iteration of MapReduce jobs, etc. In this paper, our arbitrary XML partitioning and storage strategy require no knowledge of query pattern; twig queries can be efficiently processed in a single-round MapReduce job with good scalability. Extensive experiments are conducted to verify the efficiency and scalability of our algorithms.

[1]  Hua-Gang Li,et al.  Twig2Stack: bottom-up processing of generalized-tree-pattern queries over XML documents , 2006, VLDB.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Daniel Rueckert,et al.  Machine Learning in Medical Imaging , 2014, Lecture Notes in Computer Science.

[4]  Beng Chin Ooi,et al.  Big data: the driver for innovation in databases , 2014 .

[5]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[6]  Tok Wang Ling,et al.  On boosting holism in XML twig pattern matching using structural indexing techniques , 2005, SIGMOD '05.

[7]  Hiroyuki Kitagawa,et al.  GMX: an XML data partitioning scheme for holistic twig joins , 2008, iiWAS.

[8]  Hongjun Lu,et al.  Holistic Twig Joins on Indexed XML Documents , 2003, VLDB.

[9]  Huayu Wu,et al.  Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce , 2014, DEXA.

[10]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[11]  Tok Wang Ling,et al.  From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching , 2005, VLDB.

[12]  Hiroyuki Kitagawa,et al.  XML data partitioning strategies to improve parallelism in parallel holistic twig joins , 2009, ICUIMC '09.

[13]  Manolis Gergatsoulis,et al.  Distributed Processing of XPath Queries Using MapReduce , 2013, ADBIS.

[14]  Kyong-Ha Lee,et al.  HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries , 2012, CIKM '12.

[15]  Shan Huang,et al.  ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms , 2012, Data Knowl. Eng..

[16]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.