Research on Web Page Classification Method Based on Query Log

Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.

[1]  Olfa Nasraoui,et al.  Mining search engine query logs for query recommendation , 2006, WWW '06.

[2]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[3]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[4]  David R. Karger,et al.  Using urls and table layout for web classification tasks , 2004, WWW '04.

[5]  Xiaofeng He,et al.  Learning Document Labels from Enriched Click Graphs , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[6]  Danqi Chen,et al.  Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.

[7]  Soo-Min Kim,et al.  Improving web page classification by label-propagation over click graphs , 2009, CIKM.

[8]  Ngo Van Linh,et al.  Efficient label propagation for classification on information networks , 2012, SoICT '12.

[9]  Mohammed Bennamoun,et al.  How Well Sentence Embeddings Capture Meaning , 2015, ADCS.

[10]  Berthier A. Ribeiro-Neto,et al.  Link Information as a Similarity Measure in Web Classification , 2003, SPIRE.

[11]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[14]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[15]  ChengXiang Zhai,et al.  Learning Query and Document Relevance from a Web-scale Click Graph , 2016, SIGIR.

[16]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[17]  Duoqian Miao,et al.  A Rough Set Approach to Classifying Web Page Without Negative Examples , 2007, PAKDD.

[18]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[19]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[20]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[21]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[22]  Qinmin Hu,et al.  Learning Topic-Oriented Word Embedding for Query Classification , 2015, PAKDD.

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.