Smart Crawler: Using Committee Machines for\\Web Pages Continuous Classification

The speed of information publishing in WWW is unprecedented. The individuals and organizations struggle to be up to date and find relevant knowledge from a tsunami of news, videos, posts, and comments. In the other hand, these contents (mostly bound to HTML pages) are unstructured and not explicitly classified. In this context, machine-learning techniques can be very handy to automatic separate useful information from irrelevant noise. The present paper describes a novel approach for Web Pages crawling. The Smart Crawler employs two techniques for improving the information classification: massive Web page crawling and continuous classification through committee machines. These ideas are implemented using Big Data and cloud-ready technologies, whose the cornerstone is a framework that enables memory-intensive processing, high scalability, and streaming processing. The results indicates a significant classification capability and that the classification rate can scale linearly according to the size of the dataset.

[1]  Martin Fowler,et al.  NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence , 2012 .

[2]  Antonino Staiano,et al.  Machine Learning-Based Web Documents Categorization by Semantic Graphs , 2015, Advances in Neural Networks.

[3]  George K. Karagiannidis,et al.  Efficient Machine Learning for Big Data: A Review , 2015, Big Data Res..

[4]  Ioana Manolescu,et al.  RDF in the clouds: a survey , 2014, The VLDB Journal.

[5]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[7]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[8]  Euan A Ashley,et al.  Using "big data" to dissect clinical heterogeneity. , 2015, Circulation.

[9]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[10]  Swarup Roy,et al.  Big Data Analytics in Bioinformatics: A Machine Learning Perspective , 2015, ArXiv.

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Justin Grimmer,et al.  We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together , 2014, PS: Political Science & Politics.

[13]  Thomas Swiss,et al.  The World Wide Web and Contemporary Cultural Theory: Magic, Metaphor, Power , 2000 .

[14]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[15]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[16]  Chao Wu,et al.  A Focused Crawler URL Analysis Algorithm based on Semantic Content and Link Clustering in Cloud Environment , 2015 .

[17]  Ronaldo dos Santos Mello,et al.  Definição e Avaliação de uma Abordagem para Extração e Catalogação de Conteúdo Obtido da Deep Web , 2014, SBBD.

[18]  Mauro Roisenberg,et al.  Continuous Authentication by Keystroke Dynamics Using Committee Machines , 2006, ISI.

[19]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[20]  Ian S. Graham The HTML SourceBook , 1995 .

[21]  Mariacarla Calzarossa,et al.  An extensive study of Web robots traffic , 2013, IIWAS '13.

[22]  Georg Gottlob,et al.  Web Data Extraction System , 2009, Encyclopedia of Database Systems.

[23]  De-Shuang Huang,et al.  Support Vector Machine Committee for Classification , 2004, ISNN.

[24]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[25]  Antonio Picariello,et al.  Modern Enterprises in the Bubble: Why Big Data Matters , 2015, SOEN.

[26]  Carina F. Dorneles,et al.  Automatic Web Page Segmentation and Noise Removal for Structured Extraction using Tag Path Sequences , 2013, J. Inf. Data Manag..

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[29]  Zhiping Lin,et al.  Extreme Learning Machine for Clustering , 2015 .

[30]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[31]  Mukesh Singhal,et al.  A cloud-based web crawler architecture , 2015, 2015 18th International Conference on Intelligence in Next Generation Networks.

[32]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[33]  Bruce D. Weinberg,et al.  Perspectives on Big Data , 2013 .

[34]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[35]  Zhaoxia Wang,et al.  Enhancing Machine-Learning Methods for Sentiment Classification of Web Data , 2014, AIRS.

[36]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[37]  ManolescuIoana,et al.  RDF in the clouds , 2015, VLDB 2015.

[38]  Gary Anthes,et al.  HTML5 leads a web revolution , 2012, Commun. ACM.