Handling distributed XML queries over large XML data based on MapReduce framework

Abstract With the increase in available extensible markup language (XML) documents, numerous approaches to querying have been proposed in the literature. XPath queries and Twig pattern queries are the two basic approaches, directly affecting the efficiency of XML operations. Distributive manipulation of massive XML data is challenging. This paper aims to develop an efficient distributed XML query processing method using MapReduce, which simultaneously processes several queries on large volumes of XML data. First, we split up a large-scale XML data file into file-splits and put them in a distributed storage system. Then, we present an efficient algorithm to compute different fragments of the document tree using the MapReduce framework in parallel. In order to efficiently handle a large amount of XML data, we built a partition index and used a random access mechanism for specific queries. The experiment results show that our proposed approach is efficient with good scalability.

[1]  Zhiyi Ma,et al.  TwigStack-MR: An Approach to Distributed XML Twig Query Using MapReduce , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[2]  Manolis Gergatsoulis,et al.  Distributed evaluation of XPath queries over large integrated XML data , 2014, Panhellenic Conference on Informatics.

[3]  Jiang Li,et al.  Fast Matching of Twig Patterns , 2008, DEXA.

[4]  Hiroyuki Kitagawa,et al.  Executing parallel TwigStack algorithm on a multi-core system , 2009, iiWAS.

[5]  Hua-Gang Li,et al.  Twig2Stack: bottom-up processing of generalized-tree-pattern queries over XML documents , 2006, VLDB.

[6]  Kiminori Matsuzaki,et al.  A Partial-tree-based Approach for XPath Query on Large XML Trees , 2016, J. Inf. Process..

[7]  Kyong-Ha Lee,et al.  HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries , 2012, CIKM '12.

[8]  Hyun-Ho Lee,et al.  Selectivity-sensitive shared evaluation of multiple continuous XPath queries over XML streams , 2009, Inf. Sci..

[9]  Dongsheng Wang,et al.  Distributed XPath query processing over large XML data based on MapReduce framework , 2016, 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD).

[10]  Xin Bi,et al.  Efficient Processing of Distributed Twig Queries Based on Node Distribution , 2017, Journal of Computer Science and Technology.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Wenfei Fan,et al.  Using partial evaluation in distributed query evaluation , 2006, VLDB.

[13]  Hiroyuki Kitagawa,et al.  Parallel holistic twig joins on a multi-core system , 2010, Int. J. Web Inf. Syst..

[14]  Chen Yongheng,et al.  Load balancing parallelizing XML query processing based on shared cache chip multi-processor (CMP) , 2011 .

[15]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[16]  Petr Kroha,et al.  Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework , 2017, BDAS.

[17]  Bo Ning,et al.  XML filtering with XPath expressions containing parent and ancestor axes , 2012, Inf. Sci..

[18]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[19]  Oded Shmueli,et al.  Multi-Core Processing of XML Twig Patterns , 2015, IEEE Transactions on Knowledge and Data Engineering.

[20]  Jian Liu,et al.  Dynamically querying possibilistic XML data , 2014, Inf. Sci..

[21]  Tok Wang Ling,et al.  Prefix Path Streaming: A New Clustering Method for Optimal Holistic XML Twig Pattern Matching , 2004, DEXA.

[22]  SangKeun Lee,et al.  Examining the impact of data-access cost on XML twig pattern matching , 2012, Inf. Sci..

[23]  Ghassan Z. Qadah,et al.  Indexing techniques for processing generalized XML documents , 2017, Comput. Stand. Interfaces.

[24]  Yin Li,et al.  FSPTwigFast: Holistic twig query on fuzzy spatiotemporal XML data , 2017, Applied Intelligence.

[25]  Yon Dohn Chung,et al.  SSFile: A novel column-store for efficient data analysis in Hadoop-based distributed systems , 2015, Inf. Sci..

[26]  Wenfei Fan,et al.  Distributed query evaluation with performance guarantees , 2007, SIGMOD '07.

[27]  Jianzhong Li,et al.  Partial Evaluation for Distributed XPath Query Processing and Beyond , 2012, TODS.

[28]  Slawomir Staworko,et al.  Characterizing XML Twig Queries with Examples , 2015, ICDT.

[29]  Ioana Manolescu,et al.  PAXQuery: Efficient Parallel Processing of Complex XQuery , 2014, IEEE Transactions on Knowledge and Data Engineering.

[30]  I-En Liao,et al.  CIS-X: A compacted indexing scheme for efficient query evaluation of XML documents , 2013, Inf. Sci..

[31]  Jian Liu,et al.  Matching twigs in fuzzy XML , 2011, Inf. Sci..

[32]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[33]  Dario Colazzo,et al.  Partitioning XML documents for iterative queries , 2012, IDEAS '12.

[34]  Tok Wang Ling,et al.  From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching , 2005, VLDB.

[35]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[36]  Irena Holubová,et al.  Structural and semantic aspects of similarity of Document Type Definitions and XML schemas , 2010, Inf. Sci..

[37]  Krishna Asawa,et al.  New Path Based Index Structure for Processing CAS Queries over XML Database , 2017, J. Comput. Inf. Technol..

[38]  Yin Li,et al.  Fast Leaf-to-Root Holistic Twig Query on XML Spatiotemporal Data , 2017, J. Comput..

[39]  Husheng Liao,et al.  Automatic parallelization of XQuery programs on multi-core systems , 2016, The Journal of Supercomputing.

[40]  Tok Wang Ling,et al.  TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data , 2006, DASFAA.