Data cleansing for Web information retrieval using query independent features

Understanding what kinds of Web pages are the most useful for Web search engine users is a critical task in Web information retrieval (IR). Most previous works used hyperlink analysis algorithms to solve this problem. However, little research has been focused on query-independent Web data cleansing for Web IR. In this paper, we first provide analysis of the differences between retrieval target pages and ordinary ones based on more than 30 million Web pages obtained from both the Text Retrieval Conference (TREC) and a widely used Chinese search engine, SOGOU (www.sogou.com). We further propose a learning-based data cleansing algorithm for reducing Web pages that are unlikely to be useful for user requests. We found that there exists a large proportion of low-quality Web pages in both the English and the Chinese Web page corpus, and retrieval target pages can be identified using query-independent features and cleansing algorithms. The experimental results showed that our algorithm is effective in reducing a large portion of Web pages with a small loss in retrieval target pages. It makes it possible for Web IR tools to meet a large fraction of users' needs with only a small part of pages on the Web. These results may help Web search engines make better use of their limited storage and computation resources to improve search performance. © 2007 Wiley Periodicals, Inc.

[1]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[2]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[3]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[4]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[5]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[6]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[7]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[8]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[9]  Andreas Paepcke,et al.  Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[10]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[11]  Wai Lam,et al.  A probabilistic approach for adapting information extraction wrappers and discovering new attributes , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[12]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[13]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[14]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[15]  Ravi Kumar,et al.  Core algorithms in the CLEVER system , 2006, TOIT.

[16]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[17]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[18]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[19]  Yiqun Liu,et al.  Web data cleansing for information retrieval using key resource page selection , 2005, WWW '05.

[20]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[21]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[22]  Yiqun Liu,et al.  Effective Topic Distillation with Key Resource Pre-selection , 2004, AIRS.

[23]  François Denis PAC Learning from Positive Statistical Queries , 1998, ALT.

[24]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[26]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[27]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[28]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[29]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.