Analysis and Implementation Machine Learning for YouTube Data Classification by Comparing the Performance of Classification Algorithms

Every day, people around the world upload 1.2 million videos to YouTube or more than 100 hours per minute, and this number is increasing. The condition of this continuous data will be useless if not utilized again. To dig up information on large-scale data, a technique called data mining can be a solution. One of the techniques in data mining is classification. For most YouTube users, when searching for video titles do not match the desired video category. Therefore, this research was conducted to classify YouTube data based on its search text. This article focuses on comparing three algorithms for the classification of YouTube data into the Kesenian and Sains category. Data collection in this study uses scraping techniques taken from the YouTube website in the form of links, titles, descriptions, and searches. The method used in this research is an experimental method by conducting data collection, data processing, proposed models, testing, and evaluating models. The models applied are Random Forest, SVM, Naive Bayes. The results showed that the accuracy rate of the random forest model was better by 0.004%, with the label encoder not being applied to the target class, and the label encoder had no effect on the accuracy of the classification models. The most appropriate model for YouTube data classification from data taken in this study is Naive Bayes, with an accuracy rate of 88% and an average precision of 90%.

[1]  Wu Chenggang,et al.  A new semi-supervised support vector machine learning algorithm based on active learning , 2010, 2010 2nd International Conference on Future Computer and Communication.

[2]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Dedi Setiadi,et al.  Improving Naïve Bayes in Sentiment Analysis For Hotel Industry in Indonesia , 2018, 2018 Third International Conference on Informatics and Computing (ICIC).

[4]  Marcos André Gonçalves,et al.  Improving random forests by neighborhood projection for effective text classification , 2018, Inf. Syst..

[5]  Liliana Swastina Penerapan Algoritma C4.5 Untuk Penentuan Jurusan Mahasiswa , 2013 .

[6]  Suyanto Data Mining Untuk Klasifikasi dan Klasterisasi Data, Edisi Revisi , 2019 .

[7]  Teguh Bharata Adji,et al.  ANALISIS SENTIMEN DATA PRESIDEN JOKOWI DENGAN PREPROCESSING NORMALISASI DAN STEMMING MENGGUNAKAN METODE NAIVE BAYES DAN SVM , 2015 .

[8]  Martin Hofmann,et al.  Support Vector Machines — Kernels and the Kernel Trick , 2006 .

[9]  Farookh Khadeer Hussain,et al.  A comparative analysis of machine learning models for quality pillar assessment of SaaS services by multi-class text classification of users' reviews , 2019, Future Gener. Comput. Syst..

[10]  Sulidar Fitri PERBANDINGAN KINERJA ALGORITMA KLASIFIKASI NAÏVE BAYESIAN, LAZY-IBK, ZERO-R, DAN DECISION TREE- J48 , 2014 .

[11]  Yuli Mardi,et al.  Data Mining : Klasifikasi Menggunakan Algoritma C4.5 , 2017 .

[12]  Mambang Mambang,et al.  ANALISIS PERBANDINGAN ALGORITMA C.45, RANDOM FOREST DENGAN CHAID DECISION TREE UNTUK KLASIFIKASI TINGKAT KECEMASAN IBU HAMIL , 2017 .

[13]  I. Destuardi,et al.  KLASIFIKASI EMOSI UNTUK TEKS BAHASA INDONESIA MENGGUNAKAN METODE NAIVE BAYES , 2009 .

[14]  Mark Heitmann,et al.  Comparing automated text classification methods , 2019, International Journal of Research in Marketing.