Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce

With the continuous development of Web technology, many Internet issues evolve into Big Data problems, characterized by volume, variety, velocity and variability. Among them, how to organize plenty of web pages and retrieval information needed is a critical one. An important notion is document classification, in which nearest neighbors query is the key issue to be solved. Most parallel nearest neighbors query methods adopt Cartesian Product between training set and testing set resulting in poor time efficiency. In this paper, two methods are proposed on document nearest neighbor query based on pairwise similarity, i.e. brute-force and pre-filtering. brute-force is constituted by two phases (i.e. copying and filtering) and one map-reduce procedure is conducted. In order to obtain nearest neighbors for each document, each document pair is copied twice and all records generated are shuffled. However, time efficiency of shuffle is sensitive to the number of the intermediate results. For the purpose of intermediate results reduction, pre-filtering is proposed for nearest neighbor query based on pairwise similarity. Since only first top-k neighbors are output for each document, the size of records shuffled is kept in the same magnitude as input size in pre-filtering. Additionally, detailed theoretical analysis is provided. The performance of the algorithms is demonstrated by experiments on real world dataset.

[1]  Xiaofeng Zhu,et al.  Efficient kNN Classification With Different Numbers of Nearest Neighbors , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Xiaotie Deng,et al.  Efficient Phrase-Based Document Similarity for Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[3]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[4]  Jimmy J. Lin Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[5]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[6]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[7]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[8]  Yu Wang,et al.  A Fast KNN Algorithm for Text Categorization , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[9]  Andrew W. Moore,et al.  New Algorithms for Efficient High-Dimensional Nonparametric Classification , 2006, J. Mach. Learn. Res..

[10]  James McNames,et al.  A Fast Nearest-Neighbor Algorithm Based on a Principal Axis Search Tree , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Fabian Fier Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework , 2017, PhD@VLDB.

[12]  Zaher Al Aghbari,et al.  Array-index: a plug&search K nearest neighbors method for high-dimensional data , 2005, Data Knowl. Eng..

[13]  Juan D. Velásquez,et al.  Docode 5: Building a real-world plagiarism detection system , 2017, Eng. Appl. Artif. Intell..

[14]  Michael A. Wulder,et al.  Extending Airborne Lidar-Derived Estimates of Forest Canopy Cover and Height Over Large Areas Using kNN With Landsat Time Series Data , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[15]  Yi-Ching Liaw,et al.  Fast exact k nearest neighbors search using an orthogonal search tree , 2010, Pattern Recognit..

[16]  Dimitrios Tsoumakos,et al.  Rapid AkNN Query Processing for Fast Classification of Multidimensional Data in the Cloud , 2014, ArXiv.

[17]  Andreas Hotho,et al.  Adaptive kNN using expected accuracy for classification of geo-spatial data , 2017, SAC.

[18]  M. Ghiassi,et al.  Large metropolitan water demand forecasting using DAN2, FTDNN, and KNN models: A case study of the city of Tehran, Iran , 2017 .

[19]  Changshui Zhang,et al.  Tunable Nearest Neighbor Classifier , 2004, DAGM-Symposium.

[20]  Shichao Zhang,et al.  Efficient kNN classification algorithm for big data , 2016, Neurocomputing.

[21]  M. Bataller,et al.  Feature selection for KNN classifier to improve accurate detection of subthalamic nucleus during deep brain stimulation surgery in Parkinson’s patients , 2017 .

[22]  Zahir Tari,et al.  kNNVWC: An Efficient k-Nearest Neighbors Approach Based on Various-Widths Clustering , 2016, IEEE Trans. Knowl. Data Eng..

[23]  S. Chandramathi,et al.  A Review of various k-Nearest Neighbor Query Processing Techniques , 2011 .

[24]  Stephen M. Omohundro,et al.  Five Balltree Construction Algorithms , 2009 .

[25]  Robert F. Sproull,et al.  Refinements to nearest-neighbor searching ink-dimensional trees , 1991, Algorithmica.

[26]  Stan Z. Li,et al.  Performance Evaluation of the Nearest Feature Line Method in Image Classification and Retrieval , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Anastasios Tombros,et al.  Factors Affecting Web Page Similarity , 2005, ECIR.

[28]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[29]  Xiaoming Zhu,et al.  An efficient indexing method for nearest neighbor searches in high-dirnensional image databases , 2002, IEEE Trans. Multim..

[30]  Yi-Ching Liaw,et al.  Fast k-nearest-neighbor search based on projection and triangular inequality , 2007, Pattern Recognit..

[31]  Dai Jia MapReduce Based Fast kNN Join , 2015 .

[32]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[33]  Francisco Herrera,et al.  A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification , 2015, TrustCom 2015.

[34]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.