An Embedded-Based Weighted Feature Selection Algorithm for Classifying Web Document

With the exponential increase in a number of web pages daily, it makes it very difficult for a search engine to list relevant web pages. In this paper, we propose a machine learning-based classification model that can learn the best features in each web page and helps in search engine listing. The existing methods for listing have lots of drawbacks like interfacing the normal operations of the website and crawling lots of useless information. Our proposed algorithm provides an optimal classification for websites which has a large number of web pages such as Wikipedia by just considering core information like link text, side information, and header text. We implemented our algorithm with standard benchmark datasets, and the results show that our algorithm outperforms the existing algorithms.

[1]  Zhuo Chen,et al.  The Lao Text Classification Method Based on KNN , 2020 .

[2]  Yunming Ye,et al.  ForesTexter: An efficient random forest algorithm for imbalanced text categorization , 2014, Knowl. Based Syst..

[3]  Ee-Peng Lim,et al.  On strategies for imbalanced text classification using SVM: A comparative study , 2009, Decis. Support Syst..

[4]  Sule Yildirim Yayilgan,et al.  The impact of deep learning on document classification using semantically rich representations , 2019, Inf. Process. Manag..

[5]  Pilsung Kang,et al.  Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec , 2019, Inf. Sci..

[6]  Jingzhi Guo,et al.  Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis , 2020, J. Web Semant..

[7]  Hao Wang,et al.  Ontology-based deep learning for human behavior prediction with explanations in health social networks , 2017, Inf. Sci..

[8]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[9]  Leena Mary Francis,et al.  Robust scene text recognition: Using manifold regularized Twin-Support Vector Machine , 2019, J. King Saud Univ. Comput. Inf. Sci..

[10]  Yuanyuan Wang,et al.  Witnessing Crime through Tweets: A Crime Investigation Tool based on Social Media , 2019, SIGSPATIAL/GIS.

[11]  Mahmoud Hassaballah,et al.  A novel hybrid Harris hawks optimization and support vector machines for drug design and discovery , 2020, Comput. Chem. Eng..

[12]  Alireza Alaei,et al.  Logo and seal based administrative document image retrieval: A survey , 2016, Comput. Sci. Rev..

[13]  Min-Jen Tsai,et al.  Deep learning for printed document source identification , 2019, Signal Process. Image Commun..

[14]  Yukiko Kawai,et al.  Accurate Spatial Mapping of Social Media Data with Physical Locations , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[15]  Juan Ramón Rico-Juan,et al.  Improving kNN multi-label classification in Prototype Selection scenarios using class proposals , 2015, Pattern Recognit..

[16]  Tie Qiu,et al.  Mobile Edge Computing Enabled 5G Health Monitoring for Internet of Medical Things: A Decentralized Game Theoretic Approach , 2021, IEEE Journal on Selected Areas in Communications.

[17]  David R. Karger,et al.  Using urls and table layout for web classification tasks , 2004, WWW '04.

[18]  Zongda Wu,et al.  An efficient Wikipedia semantic matching approach to text document classification , 2017, Inf. Sci..

[19]  Jung-Ho Yu,et al.  Automated management of green building material information using web crawling and ontology , 2019, Automation in Construction.

[20]  Ashutosh Kumar Singh,et al.  Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification , 2015 .

[21]  Alper Kursat Uysal,et al.  Improved inverse gravity moment term weighting for text classification , 2019, Expert Syst. Appl..

[22]  Bin Hu,et al.  Joint Computing and Caching in 5G-Envisioned Internet of Vehicles: A Deep Reinforcement Learning-Based Traffic Control System , 2020, IEEE Transactions on Intelligent Transportation Systems.

[23]  Ickjai Lee,et al.  Document-level multi-topic sentiment classification of Email data with BiLSTM and data augmentation , 2020, Knowl. Based Syst..

[24]  Bassam Al-Salemi,et al.  Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms , 2019, Inf. Process. Manag..

[25]  Qi Li,et al.  Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base , 2020, Knowl. Based Syst..

[26]  Xiaoyan Zhu,et al.  A new classification algorithm recommendation method based on link prediction , 2018, Knowl. Based Syst..

[27]  Marcello Trovati,et al.  Predatory Search-based Chaos Turbo Particle Swarm Optimisation (PS-CTPSO): A new particle swarm optimisation algorithm for Web service combination problems , 2018, Future Gener. Comput. Syst..

[28]  Tieshan Li,et al.  Modified genetic optimization-based locally weighted learning identification modeling of ship maneuvering with full scale trial , 2019, Future Gener. Comput. Syst..

[29]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[30]  Kashif Hussain,et al.  Optimal Sink Node Placement in Large Scale Wireless Sensor Networks Based on Harris’ Hawk Optimization Algorithm , 2020, IEEE Access.

[31]  Ming-Lang Tseng,et al.  Extreme learning machine optimized by whale optimization algorithm using insulated gate bipolar transistor module aging degree evaluation , 2019, Expert Syst. Appl..

[32]  Jung-Hsien Chiang,et al.  Hierarchically SVM classification based on support vector clustering method and its application to document categorization , 2007, Expert Syst. Appl..

[33]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[34]  Jun Huang,et al.  Intelligent Edge Computing in Internet of Vehicles: A Joint Computation Offloading and Caching Solution , 2021, IEEE Transactions on Intelligent Transportation Systems.