Accurate and efficient general-purpose boilerplate detection for crawled web corpora
暂无分享,去创建一个
[1] Roland Schäfer,et al. Building Large Corpora from the Web Using a New Efficient Tool Chain , 2012, LREC.
[2] Jan Pomikálek. Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .
[3] Adam Kilgarriff,et al. Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.
[4] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.
[5] Roland Schäfer,et al. CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws , 2016, LREC.
[6] Stefan Evert,et al. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .
[7] Miroslav Spousta,et al. Victor : the Web-Page Cleaning Tool , 2008 .
[8] Cédrick Fairon,et al. Building and Exploring Web Corpora. Proceedings of the 3rd web as corpus workshop, incorporating cleaneval , 2007 .
[9] S. Grossberg. Contour Enhancement , Short Term Memory , and Constancies in Reverberating Neural Networks , 1973 .
[10] Rudolf Mathar,et al. Enhanced Web Page Cleaning for Constructing Social Media Text Corpora , 2015 .
[11] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.
[12] Pavel Pecina,et al. Web Page Cleaning with Conditional Random Fields , 2007 .
[13] Dan Roth,et al. Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.
[14] Paulo Cortez. Data Mining with Multilayer Perceptrons and Support Vector Machines , 2012 .
[15] Silvia Bernardini,et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.
[16] Barry Smyth,et al. Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.
[17] Roland Schäfer,et al. Processing and querying large web corpora with the COW14 architecture , 2015 .
[18] Jean-Michel Renders,et al. Boilerplate Detection and Recoding , 2014, ECIR.
[19] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.
[20] Roland Schäfer,et al. Web Corpus Construction , 2013, Web Corpus Construction.
[21] Martin Schmidt,et al. FIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabr¨ uck , 2007 .
[22] อนิรุธ สืบสิงห์,et al. Data Mining Practical Machine Learning Tools and Techniques , 2014 .
[23] Peter Fankhauser,et al. Boilerplate detection using shallow text features , 2010, WSDM '10.
[24] Adam Kilgarriff,et al. Scaling to Billion-plus Word Corpora , 2009 .
[25] L. Buydens,et al. Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel , 2006 .