A Layered Locality Sensitive Hashing based Sequence Similarity Search Algorithm for Web Sessions

In this article we propose a Layered Locality Sensitive Hashing Algorithm to perform similarity search on the web log sequence data. Locality Sensitive Hashing has been found to be an efficient technique for the approximate nearest neighbor search over a large database, as it has sub-linear dependence on the data size even for high dimension. Mining the large web log data to provide customised services to the users is one such area where similar sessions are required to be extracted quickly. The variety of session lengths adds extra complexity to this problem. To tackle this dimension variability, the concept of layering in introduced in locality sensitive hashing and a recently proposed web page similarity measure Psim is used. The proposed method is referred to as Layered Locality Sensitive Hashing based Sequence Similarity Search Algorithm or LaLS 3 A in short. The similarity at the session level is computed using a fast sequence alignment technique FOGSAA. LaLS 3 A achieves an average time gain of 81:88% with 97:2% accurate result when compared to the exact algorithm, on NASA and ClarkNet web log datasets. Therefore, LaLS 3 A is a time efficient solution to perform similarity search between variable length sequences, where the outputs are almost as good as the exact ones.

[1]  Anssi Klapuri,et al.  Query by humming of midi and audio using locality sensitive hashing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Geert Wets,et al.  Mining Navigation Patterns Using a Sequence Alignment Method , 2004, Knowl. Inf. Syst..

[3]  Kotagiri Ramamohanarao,et al.  Personalized PageRank for Web Page Prediction Based on Access Time-Length and Frequency , 2007 .

[4]  Chaofeng Li,et al.  Similarity Measurement of Web Sessions by Sequence Alignment , 2007, 2007 IFIP International Conference on Network and Parallel Computing Workshops (NPC 2007).

[5]  PatternsYongjian,et al.  Clustering of Web Users Based on Access , 1999 .

[6]  Sanghamitra Bandyopadhyay,et al.  FOGSAA: Fast Optimal Global Sequence Alignment Algorithm , 2013, Scientific Reports.

[7]  Thomas Gottron,et al.  Locality sensitive hashing for scalable structural classification and clustering of web documents , 2013, CIKM.

[8]  Sanghamitra Bandyopadhyay,et al.  Clustering of web sessions by FOGSAA , 2013, 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS).

[9]  Issa M. Khalil,et al.  Prediction of User's Web-Browsing Behavior: Application of Markov Model , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Rong Jin,et al.  Boosting multi-kernel locality-sensitive hashing for scalable image retrieval , 2012, SIGIR '12.

[11]  Yan Ke,et al.  Efficient Near-duplicate Detection and Sub-image Retrieval , 2004 .

[12]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[13]  Osmar R. Zaïane,et al.  Clustering Web sessions by sequence alignment , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[14]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[15]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[16]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  Arindam Banerjee,et al.  Clickstream clustering using weighted longest common subsequences , 2001 .

[19]  Yongjian Fu,et al.  A Generalization-Based Approach to Clustering of Web Usage Sessions , 1999, WEBKDD.

[20]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[21]  Latifur Khan,et al.  Web Navigation Prediction Using Multiple Evidence Combination and Domain Knowledge , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[22]  Sourav S. Bhowmick,et al.  Research Issues in Web Data Mining , 1999, DaWaK.

[23]  Gillian Dobbie,et al.  Particle Swarm Optimization Based Clustering of Web Usage Data , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[24]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[25]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.