Query-Driven Frequent Co-Occurring Term Computation over Relational Data Using MapReduce

Given a keyword query q and a large structured, traditional keyword search may return a large number of relevant results to users, which leads to a frustrating procedure for the users to select their interesting results. To help users understand the data to be searched, in this work we investigate the problem of frequent co-occurring terms (FCTs) in large relational data. By returning a set of most FCTs with the given keywords, we can provide a chance for users to see a big picture of relevant data information. The investigation of FCT problem is also one of the fundamental building blocks of data mining because the discovered FCTs can be employed to analyze the topics or contexts of user interest. Although the problem of FCTs computation was proposed and investigated in Tao and Yu [(2009) Finding Frequent Co-Occurring Terms in Relational Keyword Search. 12th Int. Conf. Extending Database Technology EDBT, Saint-Petersburg, Russia, March 23–26, pp. 839–850. ACM, New York, USA], further investigation is needed to improve the performance because FCT computation is very expensive. Especially for the increasing volume of data, the centralized approach in Tao and Yu [(2009) Finding Frequent Co-Occurring Terms in Relational Keyword Search. 12th Int. Conf. Extending Database Technology EDBT, Saint-Petersburg, Russia, March 23–26, pp. 839–850. ACM, New York, USA] may incur a big challenge on the efficiency of performing an FCT computation. To do this, we investigate how to perform parallel FCT computation using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. We design an effective mapping mechanism that exploits the approximately maximal workload of FCT computation for balancing the computational cost of each processor, while reducing the shuffling cost and avoiding the data-skewness. Analytical and experimental evaluations demonstrate the efficiency and scalability of our proposed approach using TPC-H benchmark datasets with different sizes.

[1]  S. Sudarshan,et al.  BANKS: Browsing and Keyword Searching in Relational Databases , 2002, VLDB.

[2]  Surajit Chaudhuri,et al.  DBXplorer: enabling keyword search over relational databases , 2002, SIGMOD '02.

[3]  Jeffrey F. Naughton,et al.  Combining keyword search and forms for ad hoc querying of databases , 2009, SIGMOD Conference.

[4]  Peter Fankhauser,et al.  DivQ: diversification for keyword search over structured databases , 2010, SIGIR.

[5]  Ani Nenkova,et al.  Measuring Importance and Query Relevance in Topic-focused Multi-document Summarization , 2007, ACL.

[6]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[7]  Samuel Madden,et al.  Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[8]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[9]  Jiye Yu,et al.  Discovery of numerous specific topics via term co-occurrence analysis , 2010, CIKM '10.

[10]  Badrish Chandramouli,et al.  Temporal Analytics on Big Data for Web Advertising , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[11]  Philip A. Bernstein,et al.  Using Semi-Joins to Solve Relational Queries , 1981, JACM.

[12]  Ee-Peng Lim,et al.  Personalized Classification for Keyword-Based Category Profiles , 2002, ECDL.

[13]  Jianmin Wang,et al.  MapDupReducer: detecting near duplicates over massive datasets , 2010, SIGMOD Conference.

[14]  Yufei Tao,et al.  Finding frequent co-occurring terms in relational keyword search , 2009, EDBT '09.

[15]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[16]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[17]  Shan Wang,et al.  Finding Top-k Min-Cost Connected Trees in Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[18]  Nick Koudas,et al.  Measure-driven Keyword-Query Expansion , 2009, Proc. VLDB Endow..

[19]  Divyakant Agrawal,et al.  Big data and cloud computing , 2010, Proc. VLDB Endow..

[20]  Jianxin Li,et al.  XClean: Providing valid spelling suggestions for XML keyword queries , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21]  David Campbell Is it still "Big Data" if it fits in my pocket? , 2011, Proc. VLDB Endow..

[22]  Andrey Balmin,et al.  Adaptive MapReduce using situation-aware mappers , 2012, EDBT '12.

[23]  Surajit Chaudhuri,et al.  What next?: a half-dozen data management research goals for big data and the cloud , 2012, PODS '12.

[24]  Junjie Yao,et al.  Keyword Query Reformulation on Structured Data , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[25]  Magdalena Balazinska,et al.  Managing Skew in Hadoop , 2013, IEEE Data Eng. Bull..

[26]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[27]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[28]  K. Pu,et al.  Keyword query cleaning , 2008, Proc. VLDB Endow..

[29]  Sivaji Yerraguntla,et al.  CONTEXT-BASED DIVERSIFICATION FOR KEYWORD QUERIES OVER XML DATA , 2016 .

[30]  Chien Chin Chen,et al.  TSCAN: A Content Anatomy Approach to Temporal Topic Summarization , 2012, IEEE Transactions on Knowledge and Data Engineering.

[31]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[34]  Beng Chin Ooi,et al.  Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.

[35]  Wenfei Fan,et al.  On the Complexity of Query Result Diversification , 2013, Proc. VLDB Endow..

[36]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[37]  Jeffrey F. Naughton,et al.  Toward scalable keyword search over relational data , 2010, Proc. VLDB Endow..

[38]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[39]  Evaggelia Pitoura,et al.  PerK: personalized keyword search in relational databases through preferences , 2010, EDBT '10.

[40]  Jeffrey Xu Yu,et al.  Keyword Search in Relational Databases: A Survey , 2010, IEEE Data Eng. Bull..

[41]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[42]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[43]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[44]  Jeffrey Xu Yu,et al.  Scalable keyword search on large data streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[45]  Geert-Jan Houben,et al.  Groundhog day: near-duplicate detection on Twitter , 2013, WWW.