Data cleansing for Web information retrieval using query independent features

We report on a study that was undertaken to better understand what kinds of Web pages are the most useful for web search engine users by exploiting queryindependent features of retrieval target pages. To our knowledge, there has been little research towards query-independent web page cleansing for web information retrieval. Based on more than 30 million web pages obtained both from TREC and from a widely-used Chinese search engine SOGOU (www.sogou.com), we provide analysis on the differences between retrieval target pages and ordinary ones. We also propose a learning-based data cleansing algorithm for reducing Web pages which are not likely to be useful for user request. The results obtained show that retrieval target pages can be separated from low quality pages using queryindependent features and cleansing algorithms. Our algorithm succeeds in reducing 95% web pages with less than 8% loss in retrieval target pages. It makes it possible for web IR tools to meet over 92% users’ needs with only 5% pages on the Web.

[1]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[2]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[3]  Yiqun Liu,et al.  Effective Topic Distillation with Key Resource Pre-selection , 2004, AIRS.

[4]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[5]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[7]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[8]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[9]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[10]  Ravi Kumar,et al.  Core algorithms in the CLEVER system , 2006, TOIT.

[11]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[12]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[13]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[14]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[15]  Andreas Paepcke,et al.  Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[16]  Wai Lam,et al.  A probabilistic approach for adapting information extraction wrappers and discovering new attributes , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[17]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[18]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[19]  Yiqun Liu,et al.  Web data cleansing for information retrieval using key resource page selection , 2005, WWW '05.

[20]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[21]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[22]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[23]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[24]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[25]  A. K. Singh,et al.  An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.

[26]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[27]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..