论文信息 - Accurate and efficient general-purpose boilerplate detection for crawled web corpora

Accurate and efficient general-purpose boilerplate detection for crawled web corpora

AbstractRemoval of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results for search terms if these terms appear in boilerplate regions of the web page. In this paper, I present and evaluate a supervised machine-learning approach to general-purpose boilerplate detection for languages based on Latin alphabets using Multi-Layer Perceptrons (MLPs). It is both very efficient and very accurate (between 95 % and $$99\,\%$$99% correct classifications, depending on the input language). I show that language-specific classifiers greatly improve the accuracy of boilerplate detectors. The single features used for the classification are evaluated with regard to the merit they contribute to the classification. Furthermore, I show that the accuracy of the MLP is on a par with that of a wide range of other classifiers. My approach has been implemented in the open-source texrex web page cleaning software, and large corpora constructed using it are available from the COW initiative, including the CommonCOW corpora created from CommonCrawl datasets.

Roland Schäfer | R. Schäfer

[1] Roland Schäfer,et al. Building Large Corpora from the Web Using a New Efficient Tool Chain , 2012, LREC.

[2] Jan Pomikálek. Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .

[3] Adam Kilgarriff,et al. Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.

[4] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[5] Roland Schäfer,et al. CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws , 2016, LREC.

[6] Stefan Evert,et al. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .

[7] Miroslav Spousta,et al. Victor : the Web-Page Cleaning Tool , 2008 .

[8] Cédrick Fairon,et al. Building and Exploring Web Corpora. Proceedings of the 3rd web as corpus workshop, incorporating cleaneval , 2007 .

[9] S. Grossberg. Contour Enhancement , Short Term Memory , and Constancies in Reverberating Neural Networks , 1973 .

[10] Rudolf Mathar,et al. Enhanced Web Page Cleaning for Constructing Social Media Text Corpora , 2015 .

[11] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.