Accurate and efficient general-purpose boilerplate detection for crawled web corpora

AbstractRemoval of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results for search terms if these terms appear in boilerplate regions of the web page. In this paper, I present and evaluate a supervised machine-learning approach to general-purpose boilerplate detection for languages based on Latin alphabets using Multi-Layer Perceptrons (MLPs). It is both very efficient and very accurate (between 95 % and $$99\,\%$$99% correct classifications, depending on the input language). I show that language-specific classifiers greatly improve the accuracy of boilerplate detectors. The single features used for the classification are evaluated with regard to the merit they contribute to the classification. Furthermore, I show that the accuracy of the MLP is on a par with that of a wide range of other classifiers. My approach has been implemented in the open-source texrex web page cleaning software, and large corpora constructed using it are available from the COW initiative, including the CommonCOW corpora created from CommonCrawl datasets.

[1]  Roland Schäfer,et al.  Building Large Corpora from the Web Using a New Efficient Tool Chain , 2012, LREC.

[2]  Jan Pomikálek Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .

[3]  Adam Kilgarriff,et al.  Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Roland Schäfer,et al.  CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws , 2016, LREC.

[6]  Stefan Evert,et al.  Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .

[7]  Miroslav Spousta,et al.  Victor : the Web-Page Cleaning Tool , 2008 .

[8]  Cédrick Fairon,et al.  Building and Exploring Web Corpora. Proceedings of the 3rd web as corpus workshop, incorporating cleaneval , 2007 .

[9]  S. Grossberg Contour Enhancement , Short Term Memory , and Constancies in Reverberating Neural Networks , 1973 .

[10]  Rudolf Mathar,et al.  Enhanced Web Page Cleaning for Constructing Social Media Text Corpora , 2015 .

[11]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[12]  Pavel Pecina,et al.  Web Page Cleaning with Conditional Random Fields , 2007 .

[13]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[14]  Paulo Cortez Data Mining with Multilayer Perceptrons and Support Vector Machines , 2012 .

[15]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[16]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[17]  Roland Schäfer,et al.  Processing and querying large web corpora with the COW14 architecture , 2015 .

[18]  Jean-Michel Renders,et al.  Boilerplate Detection and Recoding , 2014, ECIR.

[19]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[20]  Roland Schäfer,et al.  Web Corpus Construction , 2013, Web Corpus Construction.

[21]  Martin Schmidt,et al.  FIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabr¨ uck , 2007 .

[22]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[23]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[24]  Adam Kilgarriff,et al.  Scaling to Billion-plus Word Corpora , 2009 .

[25]  L. Buydens,et al.  Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel , 2006 .