Boilerplate Detection and Recoding

Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns --- called boilerplates ---, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon. We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections.

[1]  Matthias Gallé The bag-of-repeats representation of documents , 2013, SIGIR.

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  Justin Zobel,et al.  Accurate discovery of co-derivative documents via duplicate text detection , 2006, Inf. Syst..

[4]  Yongqiang Zhang,et al.  EXMOTIF: efficient structured motif extraction , 2006, Algorithms for Molecular Biology.

[5]  Marcos Kiwi,et al.  LATIN 2006: Theoretical Informatics , 2006, Lecture Notes in Computer Science.

[6]  Adam Kilgarriff,et al.  Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.

[7]  Matthias Gallé,et al.  Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem , 2011 .

[8]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[9]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[10]  Marie-France Sagot,et al.  Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.

[11]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[12]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[13]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[14]  James A. M. McHugh,et al.  A first approach to finding common motifs with gaps , 2004, Int. J. Found. Comput. Sci..