Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques

Web scraping is a technique to extract information from various web documents automatically. It retrieves the related contentsbasedon thequery, aggregates and transforms thedata froman unstructuredformat intoastructuredrepresentation.Textclassificationbecomesavitalphase to summarizethedataandincategorizingthewebpagesadequately.Inthisarticle,usingeffectiveweb scrapingmethodologies,thedataisinitiallyextractedfromwebsites,thentransformedintoastructured form.Basedonthekeywordsfromthedata,thedocumentsareclassifiedandlabeled.Arecursive featureeliminationtechniqueisappliedtothedatatoselectthebestcandidatefeaturesubset.The finaldata-settrainedwithstandardmachinelearningalgorithms.Theproposedmodelperformswell onclassifyingthedocumentsfromtheextracteddatawithabetteraccuracyrate. KEyWoRdS Back-Propagation Neural Networks, Content Retrieval, Machine Learning, Recursive Feature Elimination, Text Classification, Web Harvesting, Web Scraping

[1]  Jun Kong,et al.  A classification of web browsing on mobile devices , 2015, J. Vis. Lang. Comput..

[2]  Sanjay Kumar Malik,et al.  Information Extraction Using Web Usage Mining, Web Scrapping and Semantic Annotation , 2011, 2011 International Conference on Computational Intelligence and Communication Networks.

[3]  Nilanjan Dey,et al.  MEDLINE Text Mining: An Enhancement Genetic Algorithm Based Approach for Document Clustering , 2016, Applications of Intelligent Optimization in Biology and Medicine.

[4]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[5]  Ioannis Antonellis,et al.  Personalized News Categorization Through Scalable Text Classification , 2006, APWeb.

[6]  Jussi Myllymaki Effective Web data extraction with standard XML technologies , 2002, Comput. Networks.

[7]  Mahieddine Djoudi,et al.  Overview of Web Content Mining Tools , 2013, ArXiv.

[8]  Anália Lourenço,et al.  Web scraping technologies in an API world , 2014, Briefings Bioinform..

[9]  Antonina Dattolo,et al.  Automatic keyphrase extraction and ontology mining for content-based tag recommendation , 2010 .

[10]  Jing Wang,et al.  Development of an automated climatic data scraping, filtering and display system , 2010 .

[11]  Chi-Chun Lo,et al.  Personalized blog content recommender system for mobile phone users , 2010, Int. J. Hum. Comput. Stud..

[12]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[13]  Markus Hofmann,et al.  Text Mining and Visualization: Case Studies Using Open-Source Tools , 2016 .

[14]  Jose A. Ramirez-Hernandez,et al.  Control of a re-entrant line manufacturing model with a reinforcement learning approach , 2007, ICMLA 2007.

[15]  Yi-Cheng Ku,et al.  A semantic-expansion approach to personalized knowledge recommendation , 2008, Decis. Support Syst..

[16]  Eloisa Vargiu,et al.  Exploiting web scraping in a collaborative filtering- based approach to web advertising , 2012, Artif. Intell. Res..

[17]  Santanu Kumar Rath,et al.  Classification of sentiment reviews using n-gram machine learning approach , 2016, Expert Syst. Appl..

[18]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[19]  Scm De S Sirisuriya A Comparative Study on Web Scraping , 2015 .

[20]  Grazyna Suchacka,et al.  Detection of Internet robots using a Bayesian approach , 2015, 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF).

[21]  Yu Xue,et al.  Text classification based on deep belief network and softmax regression , 2016, Neural Computing and Applications.

[22]  Jung-Hyun Lee,et al.  User Preference Mining through Hybrid Collaborative Filtering and Content-Based Filtering in Recommendation System , 2004, IEICE Trans. Inf. Syst..

[23]  Dominika Tkaczyk,et al.  CERMINE: automatic extraction of structured metadata from scientific literature , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[24]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[25]  Santosh Kumar Gupta,et al.  Web Content Mining Techniques: A Survey , 2012 .

[26]  Deepak Panta Web crawling and scraping : developing a sale-based website , 2015 .