A Survey of Web Page Cleaning Research

T he rapid development of the Internet has made a variety o f Web applicat ions and Web data, which become the majo r source of data fo r lots of research. Web page includes a variety o f content, such as advert ising , navigat ion bar , r elated links, tex t , etc. How ever, fo r dif ferent studies and applications, not al l content is necessary ; oppositely, the unrelated content w ill af fect the effect iveness and eff iciency of the resear ch and applicat ions. So Web page cleaning is a highlighted topic of informat ion r et rieval w ith booming search engines. T hus it is necessar y to sum up the field on the page de no ise, in o rder to bet ter carry out in depth study. F ir st ly, this paper gives a brief intr oduct ion to the necessity of Web page cleaning and its related concepts. T he authors present a classif icat ion hierarchy of the Web page cleaning methods, including the single model based Web page cleaning methods and the multi model based Web page cleaning methods. T hen, this paper summarizes all kinds of Web page cleaning techniques and framew o rks, including SST, Shingle, Pagelet , DSE, etc. Thirdly , this paper describes the experimental datasets and experimental methods used in all kinds o f W eb page cleaning techniques. F inally, this paper discusses the ex ist ing problems and the future dir ections in the Web page cleaning f ield.