论文信息 - Boilerplate Detection and Recoding

Boilerplate Detection and Recoding

Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns --- called boilerplates ---, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon. We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections.

Jean-Michel Renders | Matthias Gallé

[1] Matthias Gallé. The bag-of-repeats representation of documents , 2013, SIGIR.

[2] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3] Justin Zobel,et al. Accurate discovery of co-derivative documents via duplicate text detection , 2006, Inf. Syst..

[4] Yongqiang Zhang,et al. EXMOTIF: efficient structured motif extraction , 2006, Algorithms for Molecular Biology.

[5] Marcos Kiwi,et al. LATIN 2006: Theoretical Informatics , 2006, Lecture Notes in Computer Science.

[6] Adam Kilgarriff,et al. Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.

[7] Matthias Gallé,et al. Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem , 2011 .

[8] Marie-France Sagot,et al. RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[9] Dan Roth,et al. Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[10] Marie-France Sagot,et al. Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.

[11] Marie-France Sagot,et al. Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[12] Peter Fankhauser,et al. Boilerplate detection using shallow text features , 2010, WSDM '10.

[13] Wolfgang Nejdl,et al. A densitometric approach to web page segmentation , 2008, CIKM '08.

[14] James A. M. McHugh,et al. A first approach to finding common motifs with gaps , 2004, Int. J. Found. Comput. Sci..