TEXT: Automatic Template Extraction from Heterogeneous Web Pages

World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.

[1]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[2]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[3]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[4]  Valter Crescenzi,et al.  Clustering Web pages based on their structure , 2005, Data Knowl. Eng..

[5]  Juliana Freire,et al.  A fast and robust method for web page template detection and removal , 2006, CIKM '06.

[6]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[7]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[8]  Junghoo Cho,et al.  RankMass crawler: a crawler with high personalized pagerank coverage guarantee , 2007, VLDB 2007.

[9]  Xiang Zhang,et al.  CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition , 2008, SIGMOD Conference.

[10]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[11]  Mark D. Plumbley Clustering of Sparse Binary Data using a Minimum Description Length Approach , 2002 .

[12]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[13]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[14]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[17]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[18]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[19]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[20]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[21]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[22]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[23]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[24]  Philip S. Yu,et al.  Co-clustering by block value decomposition , 2005, KDD '05.