Text Relevance Analysis Method over Large-Scale High-Dimensional Text Data Processing

As the amount of digital information is exploding in social, industry and scientific areas, MapReduce is a distributed computation framework, which has become widely adopted for analytics on large-scale data. Also, the idea which is used to solve the large-scale data problem by the use of approximation algorithms has become a very important solution in recent years. Especially for solving high-dimensional text data processing, semantic Web and search engine are required to pay attention to proximity searches and text relevance analysis. The difficulties of large-scale text processing mainly include its quick comparison and relevance judgment. In this paper, we propose an approximate bit string for approximation search method on MapReduce platform. Experiments exhibits excellent performance on efficiency effectiveness and scalability of the proposed algorithms.

[1]  Younghoon Kim,et al.  Parallel Top-K Similarity Join Algorithms Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[2]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[3]  Jing Peng,et al.  Kernel Vector Approximation Files for Relevance Feedback Retrieval in Large Image Databases , 2005, Multimedia Tools and Applications.

[4]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[5]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[6]  Kyuseok Shim,et al.  MapReduce Algorithms for Big Data Analysis , 2012, Proc. VLDB Endow..

[7]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[8]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Srinivasan Parthasarathy,et al.  Scalable all-pairs similarity search in metric spaces , 2013, KDD.

[11]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[12]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[13]  Shih-Fu Chang,et al.  Semi-supervised hashing for scalable image retrieval , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Jimmy J. Lin Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[15]  Shih-Fu Chang,et al.  Spherical hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Cong Yu,et al.  Near neighbor join , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[17]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[18]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[19]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[21]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.